Hy,
I am currently a master student with a focus on natural language processing, machine learning, information retrieval and data mining.
The Tor website lists a bunch of ideas, one of which is "Search Engine for Hidden Services"[1]. This project suits me well given my education and skill set and I would really enjoy it. Does tor-dev think this would be a good project? There are already many hidden search engines, although non are open source.
I have done two smaller information retrieval projects in university this year, and I have a strong background in search engine algorithms. The components of the system that I am currently thinking of are: - index and features in a nosql database (possibly CodernityDB) - hidden service crawler - simple search using BM25, but recording click through and many features other than BM25. - Basic front-end. - A component for 'Learning to rank' based on more features, which should be used once there is significant click-through data. This should be an easy to use program that performs search engine optimization.
The recording of the click through is done in order to learn to search better. This is important because there is no known search ranker that will give excellent results out of the box. Click through recording can be done by only recording feature weights. I would work in Python because I am very comfortable working with it.
What are your thoughts?
R.
P.S. I would also love to do the traffic confirmation attack, but as far as I understand there is no good data set that is readily available, and making + using one will be beyond the scope of GSoC.
[1] https://www.torproject.org/getinvolved/volunteer.html.en#Coding