Hello everyone!
I am Pushkar, one of GSoC17 accepted students. I am a third year undergraduate at International Institute of Information Technology, Hyderabad (India). I have been working on 'Ahmia - Hidden Service Search' [0][1] for some time now and will be extending my contribution through GSoC this summer. I am being mentored by Juha Nurmi (numes) and George (asn).
Ahmia is a search engine that indexes, searches, and catalogs content published on Tor Hidden Services. Furthermore, it is a medium to share meaningful insights, statistics, and news about the Tor network itself. There are several improvements and upgrades required in Ahmia.
*>>Tasks * ● Automate Blacklisting
Fetch a list of child abuse media sites and remove these sites from Elasticsearch. Also add MD5 checksums of child abuse websites to banned database for others to check.
● Add Hidden Services page
Improve the existing Add page so that adding a website stores the data to SQL Database under '/onionsadded'. From there crawler can crawl these websites once a day. Remove the entries after 1 week so that the list is fresh .
● Data visualization
Graphs need to plotted for various statistics in the Statistics page. Some examples include:
○ Linking structure between sites and keyword based labeling for onions in the graph
-
○ Popularity of domains according to backlinks and search clicks. I plan to use either Google Charts or D3.js to plot these graphs.
● Replace Polipo with Tor Socks5 proxy in ahmia-crawler
As of now Ahmia crawler uses Polipo as an HTTP proxy to direct tor traffic. But since Polipo is now no longer maintained and torsocks can provide better functionality, the crawler code needs to be updated to use torsocks. Modules like socksipy can be used to connect crawler to torsocks.
● Upgrade support from Elastic 2.4.0 to 5.X
Ahmia settings should be adjusted accordingly to support Elastic 5.X. It will require a full cluster restart since rolling upgrades are not supported in major version upgrade. Upgrading includes replacing Groovy scripts with Painless. Painless is sandboxed and a Elasticsearch targeted scripting language which replaced Groovy in Elastic 5.0.0.
● Detailed Documentation and update software dependencies
A detailed documentation at ahmia.fi[0] as well as on the Github page[1].
● Advance search options
-
Advance Search options as mentioned below can be incorporated in search bar to allow better customisable searches.
○ Double quotes(""): Returns pages that contain exactly "term"(case sensitive).
○ AND operator(&&): Logical AND gate i.e. it returns all the pages that contain all queries separated by ‘&&’.
○ OR operator(||): Logical OR gate i.e it returns all the pages that contain queries separated by ‘||’.
This is one of the optional tasks I have included. If any of the features mentioned above is not completed in the given timeline, this feature will be dropped and priority will be given to the uncompleted task.
Timeline
Week 1 - Automating blacklisting of onions with child abuse content Week 2 - Tweaking 'Add' page to save the added onion under '/onionsadded' Week 3 - Replace Polipo with Torsocks5 in ahmia-crawler Week 4 - 1st Evaluation Week 5+6 - Data visualization of statistics Week 7 - Upgrade support from Elastic2.4.0 to Elastic5.X Week 8 - 2nd Evaluation Week 9 - Updating dependancies and documentation Week 10 - Adding advanced search options like "",|| and && Week 11 - Catch up and bug fixes
I will be mailing biweekly status report to this list. Feel free to contact me if you have any suggestions or doubts.
IRC: mdhash
I would like to thank Juha and the Tor team for their constant support and guidance. It has been a great experience for me to contribute to TorProject and I look forward to be a core member of the community.
Thanks, Pushkar Pathak
[0]: https://ahmia.fi [1]: https://github.com/ahmia