[tor-dev] Searchable Tor descriptor and Metrics data archive - GSoC 2013

Praveen Kumar praveen97uma at gmail.com
Wed Apr 10 21:44:44 UTC 2013


I am Praveen Kumar from India. I want to work on the project "Searchable
Tor descriptor and Metrics data archive". I have participated in the past
instances of GSoC with Melange and e-cidadania, and have an extensive
experience in development with Python.

For the search application, I propose using Django with MongoDB as a NoSQL
database backend for our search application. We have 100GB+ of data which
eventually grows everyday and so using a NoSQL backend will ensure us that
our application scales well with the increase in data as well as user
 The application will have various interfaces such as:
1) Data Updator: This end will connect and retrieve data from the metrics
website periodically via rsync. It will also be responsible for
pre-processing the data to a suitable format as our search application
2) Storage End: A relay descriptor can be searched by nickname,
fingerprint, IP Addr and various other attributes that define a relay
descriptor. So we can preprocess the whole data, extract the attributes
that define a descriptor and then save it in an appropriate model MongoDB
provides. Since queries are very fast in a NoSQL datastore, our searches
will be very fast.
3) Search Front End: This will be exposed to the user where a user provides
its search query to us.
4) Search query processor: This end will process the query of a user and
determine its type for eg. whether the query is an IP Address or a nickname
etc. It will then connect with our Storage End and return the appropriate
data to the Search Front End.

Above is a very high level view of my approach to this project. We can also
use Django Haystack as a search application framework(I did some research
for existing search frameworks). I can implement this app in an object
oriented way in Python. Python being such a beautiful and easy to
understand language, it will be easy for others to understand and make
changes to the application in least amount of time.

I would like to know if I am thinking in the right direction and would like
to know what Karsten has to say about this.

