[tor-dev] Hidden service search engine (GSoC)

Rémi remi.py at yandex.com
Fri Mar 7 15:01:11 UTC 2014


Hy,

I am currently a master student with a focus on natural language
processing, machine learning, information retrieval and data mining.

The Tor website lists a bunch of ideas, one of which is "Search Engine
for Hidden Services"[1]. This project suits me well given my education
and skill set and I would really enjoy it.
Does tor-dev think this would be a good project? There are already many
hidden search engines, although non are open source.

I have done two smaller information retrieval projects in university
this year, and I have a strong background in search engine algorithms.
The components of the system that I am currently thinking of are:
- index and features in a nosql database (possibly CodernityDB)
- hidden service crawler
- simple search using BM25, but recording click through and many
features other than BM25.
- Basic front-end.
- A component for 'Learning to rank' based on more features, which
should be used once there is significant click-through data. This should
be an easy to use program that performs search engine optimization.

The recording of the click through is done in order to learn to search
better. This is important because there is no known search ranker that
will give excellent results out of the box. Click through recording can
be done by only recording feature weights.
I would work in Python because I am very comfortable working with it.


What are your thoughts?

R.

P.S.
I would also love to do the traffic confirmation attack, but as far as I
understand there is no good data set that is readily available, and
making + using one will be beyond the scope of GSoC.

[1] https://www.torproject.org/getinvolved/volunteer.html.en#Coding


More information about the tor-dev mailing list