>>Tasks
● Automate BlacklistingFetch a list of child abuse media sites and remove these sites from Elasticsearch. Also add MD5 checksums of child abuse websites to banned database for others to check.
Improve the existing Add page so that adding a website stores the data to SQL Database under '/onionsadded'. From there crawler can crawl these websites once a day. Remove the entries after 1 week so that the list is fresh .
● Data visualization
Graphs need to plotted for various statistics in the Statistics page. Some examples include:
○ Linking structure between sites and keyword based labeling for onions in the graph
○ Popularity of domains according to backlinks and search clicks. I plan to use either Google Charts or D3.js to plot these graphs.
As of now Ahmia crawler uses Polipo as an HTTP proxy to direct tor traffic. But since Polipo is now no longer maintained and torsocks can provide better functionality, the crawler code needs to be updated to use torsocks. Modules like socksipy can be used to connect crawler to torsocks.
Ahmia settings should be adjusted accordingly to support Elastic 5.X. It will require a full cluster restart since rolling upgrades are not supported in major version upgrade. Upgrading includes replacing Groovy scripts with Painless. Painless is sandboxed and a Elasticsearch targeted scripting language which replaced Groovy in Elastic 5.0.0.
● Detailed Documentation and update software dependencies
A detailed documentation at ahmia.fi[0] as well as on the Github page[1].
● Advance search options
Advance Search options as mentioned below can be incorporated in search bar to allow better customisable searches.
○ Double quotes(""): Returns pages that contain exactly "term"(case sensitive).
○ AND operator(&&): Logical AND gate i.e. it returns all the pages that contain all queries separated by ‘&&’.
○ OR operator(||): Logical OR gate i.e it returns all the pages that contain queries separated by ‘||’.
This is one of the optional tasks I have included. If any of the features mentioned above is not completed in the given timeline, this feature will be dropped and priority will be given to the uncompleted task.