[tor-dev] [GSoC 2013] Status report - Searchable metrics archive

Kostas Jakeliunas kostas at jakeliunas.com
Sat Aug 10 19:28:44 UTC 2013


another busy benchmarking + profiling period for database querying, but
this time more rigorous and awesome.

  * wrote a generic query analyzer which logs query statements, EXPLAIN,
ANALYZE, spots and informs of particular queries that yield inefficient
query plans;
  * wrote a very simple but rather exhaustive profiler (using python's
cProfile) which logs query times, function calls, etc.; output is used to
see which parts of the e.g. backend are slow during API calls; output can
be easily used to construct a general query 'profile' for a particular
database, etc.; [1]
  * benchmarked lots of different queries using these tools, recorded query
times, was able to observe deviations/discrepancies;
  * uploaded the whole database and benchmarked briefly on an amazon EC2
m2.2xlarge instance;
  * concluded that, provided there is enough memory to cache *and hold* the
indexes in cache, query times are good;
  * in particular, tested the following query scheme extensively: [2] (see
comments there as well if curious); concluded that it runs well;
  * opted for testing raw SQL queries (from within Flask/python) - so far,
translating them into ORM queries (while being careful) resulted in
degraded performance; if we have to end up using raw SQL, I will create a
way to encapsulate them nicely;
  * made sure data importing is not slowed and remains a quick-enough
  * researched PostgreSQL stuff, especially its two-layer caching; I now
have an understanding of the way pgsql caches things in memory, how
statistics on index usage are gathered and used for maintaining
buffer_cache, etc.
The searchable metrics archive would work best when all of its indexes are
kept in memory.
  * to this end, looked into buffer cache hibernation [3], etc.; I think
pg_prewarm [4, 5] would serve our purpose well. (Apparently many
business/etc. solutions do find cache prewarming relevant - pity it's not
supported in stock PostgreSQL.)

The latter means that
  * I don't think we can avoid using certain postgresql extensions (if only
one) - which means that deploying will always take more than apt-get && pip
install, but I believe it is needed;
 * next on my agenda is testing pg_prewarm on EC2 and, hopefully, putting
our beloved database bottleneck problem to rest.

I planned to expose the EC2 for public tor-dev inquiry (and ended up
delaying status report yet again), but I'll have to do this separately.
This is possible, however. Sorry for the delayed report.


More generally,

I'm happy with my queer queries [2] now;
the two constraints/goals of

  * being able to run Onionoo-like queries on the whole descriptor / status
entry database
  * being able to get a list of status entries for a particular relay

will hopefully be put to rest very soon. The former is done, provided I
have no trouble setting up a database index precaching system (which will
ensure that all queries of the same syntax/scheme run quick enough.)

Overall, I'm spending a bit too much time on a specific problem, but at
least I have a more intimate lower-level knowledge of PostgreSQL, which
turns out to be very relevant to this project. I hope to be able to soon
move to extending Onionoo support and providing a clean API for getting
lists of consensuses in which a particular relay was present. And maybe
start with the frontend. :)


[2]: https://github.com/wfn/torsearch/blob/master/misc/nested_join.sql
[5]: http://raghavt.blogspot.com/2012/04/caching-in-postgresql.html
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.torproject.org/pipermail/tor-dev/attachments/20130810/5f5e913a/attachment.html>

More information about the tor-dev mailing list