[tor-dev] Searchable Tor descriptor and Metrics data archive - GSoC 2013

Karsten Loesing karsten at torproject.org
Tue Apr 16 06:04:46 UTC 2013


On 4/14/13 1:26 PM, Praveen Kumar wrote:
> Hi Karsten,

Hi Praveen,

> I am so sorry for replying late. I had a seminar presentation on Friday and
> have another
> on Monday, so I was a little busy studying for it.

No worries!

> I had downloaded about 1GB data of Server Descriptors from the metrics
> website. I thought
> of generating some performance metrics of a search application with a MySQL
> backend and
> with a MongoDB database backend in Django. So, I implemented two basic apps
> with a MySQL
> and MongoDB backend in Django. I processed each file and extracted
> router_nickname, router_ip,
> tor_version and platform_os as searchable fields for each server descriptor
> file. At the time of writing this email, I had processed around 330,000
> files for MySQL and have the data of 670,000 files in MongoDB. I can not
> process all the files as that 1GB data is composed of millions of files and
> processing is slow on my system.
> My aim is to issue same queries to both the apps and see which one performs
> better. Both the databases are
> indexed on the same fields. I will tell you the metrics day after tomorrow
> i.e on Tuesday.

Sounds like a fine start.  Be sure to include results of this
performance comparison in your GSoC application!

> But, theoretically speaking, MongoDB is fast because every document is
> stored in JSON, it is schema less and doesn't has to preform any joins etc.
> The indexes that are built are based on BTrees which have the worst case
> time complexity of O(log(n)) for insertion, lookup and deletion. MongoDB
> also keeps the indexes in RAM as required, for faster searches and to
> reduce disk reads. MongoDB also has the capability of scaling efficiently.

Well, performance of MongoDB vs. MySQL really depends on the problem
you're trying to solve.  For example, we'll have to perform joins when
storing a network status consensus that references 0..n server
descriptors each of which references 0..1 extra-info descriptors.  See
the descriptor formats page for details:

https://metrics.torproject.org/formats.html

Also, with respect to scaling, the plan would be to run this application
on a single server along with other services.

So, in general, I'd be careful with "MongoDB is fast because"
statements.  Some of them may be correct in this specific case.  But
there may also be cases where good old SQL has performance advantages
over shiny new NoSQL.

> I am now, somewhat, in favor of Django Haystack with Solr as the search
> engine. Using MongoDB will
> require us to spend considerable time developing the search interface which
> will be responsible for handling complicated queries and then create
> appropriate indices to handle those complicated queries.

Sounds good!  You should include your preliminary results in your GSoC
application, too.

Best,
Karsten



More information about the tor-dev mailing list