[tor-project] GSoC17 Student Introduction

Pushkar Pathak pushkarpathak21 at gmail.com
Fri May 12 18:47:00 UTC 2017


Hello everyone!

I am Pushkar, one of GSoC17 accepted students. I am a third year
undergraduate at International Institute of Information Technology,
Hyderabad (India). I have been working on 'Ahmia - Hidden Service Search'
[0][1] for some time now and will be extending my contribution through GSoC
this summer. I am being mentored by Juha Nurmi (numes) and George (asn).

Ahmia is a search engine that indexes, searches, and catalogs content
published on Tor Hidden Services. Furthermore, it is a medium to share
meaningful insights, statistics, and news about the Tor network itself.
There are several improvements and upgrades required in Ahmia.

*>>Tasks *
●  Automate Blacklisting

Fetch a list of child abuse media sites and remove these sites from
Elasticsearch. Also add MD5 checksums of child abuse websites to banned
database for others to check.



●  Add Hidden Services page

Improve the existing Add page so that adding a website stores the data to
SQL Database under '/onionsadded'. From there crawler can crawl these
websites once a day. Remove the entries after 1 week so that the list is
fresh .



●  Data visualization

Graphs need to plotted for various statistics in the Statistics page. Some
examples include:

○  Linking structure between sites and keyword based labeling for onions in
the graph


   -

   ○  Popularity of domains according to backlinks and search clicks. I
   plan to use either Google Charts or D3.js to plot these graphs.

●  Replace Polipo with Tor Socks5 proxy in ahmia-crawler

As of now Ahmia crawler uses Polipo as an HTTP proxy to direct tor traffic.
But since Polipo is now no longer maintained and torsocks can provide
better functionality, the crawler code needs to be updated to use torsocks.
Modules like socksipy can be used to connect crawler to torsocks.



●  Upgrade support from Elastic 2.4.0 to 5.X

Ahmia settings should be adjusted accordingly to support Elastic 5.X. It
will require a full cluster restart since rolling upgrades are not
supported in major version upgrade. Upgrading includes replacing Groovy
scripts with Painless. Painless is sandboxed and a Elasticsearch targeted
scripting language which replaced Groovy in Elastic 5.0.0.



●  Detailed Documentation and update software dependencies

A detailed documentation at ahmia.fi[0] as well as on the Github page[1].

●  Advance search options

   -

   Advance Search options as mentioned below can be incorporated in search
   bar to allow better customisable searches.

   ○ Double quotes(""): Returns pages that contain exactly "term"(case
   sensitive).

   ○ AND operator(&&): Logical AND gate i.e. it returns all the pages that
   contain all queries separated by ‘&&’.

   ○ OR operator(||): Logical OR gate i.e it returns all the pages that
   contain queries separated by ‘||’.

   This is one of the optional tasks I have included. If any of the
   features mentioned above is not completed in the given timeline, this
   feature will be dropped and priority will be given to the uncompleted task.

>>Timeline

Week 1 - Automating blacklisting of onions with child abuse content
Week 2 - Tweaking 'Add' page to save the added onion under '/onionsadded'
Week 3 - Replace Polipo with Torsocks5 in ahmia-crawler
Week 4 - 1st Evaluation
Week 5+6 - Data visualization of statistics
Week 7 - Upgrade support from Elastic2.4.0 to Elastic5.X
Week 8 - 2nd Evaluation
Week 9 - Updating dependancies and documentation
Week 10 - Adding advanced search options like "",|| and &&
Week 11 - Catch up and bug fixes

I will be mailing biweekly status report to this list. Feel free to contact
me if you have any suggestions or doubts.

IRC: mdhash

I would like to thank Juha and the Tor team for their constant support and
guidance. It has been a great experience for me to contribute to TorProject
and I look forward to be a core member of the community.

Thanks,
Pushkar Pathak

[0]: https://ahmia.fi
[1]: https://github.com/ahmia
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.torproject.org/pipermail/tor-project/attachments/20170513/80865c93/attachment.html>


More information about the tor-project mailing list