[tor-dev] Searchable Tor descriptor and Metrics data archive - GSoC 2013

Thu Apr 11 08:20:41 UTC 2013

On 4/10/13 11:44 PM, Praveen Kumar wrote:
> I am Praveen Kumar from India. I want to work on the project "Searchable
> Tor descriptor and Metrics data archive". I have participated in the past
> instances of GSoC with Melange and e-cidadania, and have an extensive
> experience in development with Python.
> 
> For the search application, I propose using Django with MongoDB as a NoSQL
> database backend for our search application. We have 100GB+ of data which
> eventually grows everyday and so using a NoSQL backend will ensure us that
> our application scales well with the increase in data as well as user
> traffic.
>  The application will have various interfaces such as:
> 1) Data Updator: This end will connect and retrieve data from the metrics
> website periodically via rsync. It will also be responsible for
> pre-processing the data to a suitable format as our search application
> needs.
> 2) Storage End: A relay descriptor can be searched by nickname,
> fingerprint, IP Addr and various other attributes that define a relay
> descriptor. So we can preprocess the whole data, extract the attributes
> that define a descriptor and then save it in an appropriate model MongoDB
> provides. Since queries are very fast in a NoSQL datastore, our searches
> will be very fast.
> 3) Search Front End: This will be exposed to the user where a user provides
> its search query to us.
> 4) Search query processor: This end will process the query of a user and
> determine its type for eg. whether the query is an IP Address or a nickname
> etc. It will then connect with our Storage End and return the appropriate
> data to the Search Front End.
> 
> Above is a very high level view of my approach to this project. We can also
> use Django Haystack as a search application framework(I did some research
> for existing search frameworks). I can implement this app in an object
> oriented way in Python. Python being such a beautiful and easy to
> understand language, it will be easy for others to understand and make
> changes to the application in least amount of time.
> 
> I would like to know if I am thinking in the right direction and would like
> to know what Karsten has to say about this.

Hi Praveen!

Glad to see that you're interested in this project!

Your high-level description makes sense to me.  I guess the point where
I'd expect more details in a GSoC application is where you say: "Since
queries are very fast in a NoSQL datastore, our searches will be very fast."

See also the last paragraph in the project idea: "Applications for this
project should come with a design of the proposed search application,
ideally with a proof-of-concept based on a subset of the available data
to show that it will be able to handle the 100G+ of data."  I'd like to
understand why you think MongoDB will handle searches sufficiently fast.

As an alternative to relying on NoSQL databases doing magic is to
investigate Django Haystack and other existing search application
frameworks.

Note that I don't know what's the best tool or design here.  But I ran
into too many pitfalls in the past when I thought a database design was
fast enough to provide data for an Internet-facing service.  That's why
I'd like to see a convincing design first, bonus points if it comes with
a proof of concept.

Best,
Karsten