[tor-dev] [GSoC '13] Status report - Searchable metrics archive
kostas at jakeliunas.com
Fri Sep 20 16:18:38 UTC 2013
TL;DR of TL;DR: all good as far as GSoC scope goes; work will continue; the
live torsearch backend has been updated; will update API doc today.
- updated database / imported data up until current date
- hourly cronjob continuously rsync's latest archival data, imports to DB
(+ tested and deployed):
* had some trouble ensuring no duplication but also no excessive DB
querying (the intermediary 'fingerprint' table (that basically solves the
DB bottleneck problem) was previously updated in semi-manual-mode); solved
with a kind of an 'upsert'; works well;
* made switching from archival data import process to the rsync'ed
::metrics-recent import seamless (data can overlap, importer will take care
of it; import speed is good)
- Onionoo API improvements:
* date range queries for all three document types currently provided
* the network status entry document was expanded/improved, as per
previous discussion and Karsten's feedback
* yet to finish off IP address and nickname summarization - for now,
providing two document versions - original and 'condensed' (valid-after
* updating Onionoo API doc soon (sorry - had meant to update until now.)
* tried some things, settled for a semi-vanilla python-level-caching
solution for the time being, some testing was done
* have yet to still work out some snags, not deploying for now
* more (pure json response dumping to disk, etc.) can be done later on
- other things:
* wrote a working stub / minimal version for a
DB-maintenance-from-within-torsearch component (e.g. for a full VACUUM with
no DB access)
* other small things I may have forgotten to mention
General picture: the metrics archive / torsearch is in decent shape and an
appropriate (to gsoc) stage of development. Development will continue as
we're not done here yet, but the archive is in a functioning state,
delivering the minimum functionality required. I will finish cleaning up
before the hard pencils down date; and then continue development (at
whatever pace). :) The things that were blocking before are no longer
blocking, so ironically, I think I'll now be able to write more code i.r.l.
to torsearch than over the summer. Anyway -
I've been working on things falling within the GSoC scope, finishing the
most urgent matters that were left re: gsoc milestone - quoting from my
previous status report re: what to do next (see the bullet point list above
for a concise no-ramble-mode version of all this):
> update the database to the latest consensuses and descriptors
> import 2008 year data, etc.
I had parts of 2009 data in previously while working on DB bottlenecks, but
presently, the statuses still start at 2010; this got pushed down the
priority lane and takes a lot of (passive) time. Will be able to
batch-import now and confirm.
> turn on the cronjob for rsync and import of newest archives
Done. [0, 1] What kept me a bit busy was integrating the third
helper/mediating 'fingerprint' table (we query this one before extracting
things from the massive 'statusentry' table (which contains all network
status entries)) into the batch_import_consensuses process used for import.
I had some problems making sure data in the fingerprint table is not
duplicated and contains latest info (last valid-after etc), while
minimizing the amount of work and queries needed to test for row/entry
existence, etc. Solved with an actually simple hybrid 'upsert' approach
, all is well (did some testing, OK, but not extensive.)
Improved the archival importer for it to be able to process duplicate
consensus documents from different directories (hence they're bypassing
Stem's persistence file check - so need to take care of duplicated
processing ourselves) - this might happen in production when, after
massively importing data from the downloaded archives, we switch to rsync's
::metrics-recent/relay-descriptors folder for import. Consensuses may
overlap (they overlapped when I did the switch.) When passed down a
consensus document by Stem (so after checking for duplication in Stem's
persistence file), we simply check whether we already store this consensus,
using a separate 'consensus' table, which only stores consensus document
validity ranges (this is fast and works nicely). 
> [from an older report] Onionoo date range queries
Done.  Works for all three document types (details, summary, statuses);
can be used together with offset+limit, etc.
Smart/decent datetime parsing - can pass '2011-07-28 12:00:00', or
'2011-07', or '2011', etc.
Tried all kinds of options, and since I wanted to have something decent and
stable working for now, opted for a very simplistic stock python-level
caching approach (no onionoo-like 'JSON documents to disk' thing yet.) But
I've yet to work out some things, delayed deploying code and live version.
> hopefully later integration with the bandwidth and weights documents in
Out of current scope; hopefully will be able to work on this next / soon.
> [from an older report] Documentation, specification/implementation
Not finished yet - not very good. I should try more of this "publish early
and often" thing.
> expand the list of fields contained in the three [...] documents
Tried things, but e.g. providing assigned flag data was pushed to later.
Current live backend not changed in regards to this.
> [from an older report] rewrite/improve the network status entry document
/ improve the 'statuses' API point
Done  (deployed changes, will update API doc - this is not a nice thing
to do (i.e. doc being not updated), but since it's not production yet,
For now, we are providing two network status document versions - the
original one (+ 'relays_published'), and a condensed one (?condensed=true).
The latter basically zips up all the valid-after values into ranges, as per
Karsten's suggestion, basically telling the client where any gaps in
consensuses were present (which may turn out to be a rather useful thing,
by the way.) This works well together with from..to date ranges, offset,
I'm yet to finish off IP address and nickname summarization - for now, when
in 'condensed' presentation mode, each range contains its last addresses
Also, as torsearch might one day be deployed on a machine such that the
maintainer won't have direct access to the DB server, we'll need to do
VACUUM FULL et al. from within torsearch. It is advisable to do a full
VACUUM or even a REINDEX now and then, especially after a ton of entries is
added (say, after a massive/bulk data batch-import.) Wrote a working stub
for a 'torsearch/DB maintenance' component (for vacuum full, for now.) 
Other maintenance things may be added to this separate file/component later
I think that's all for now. Hopefully I'm past the 'experiment with all the
things!' stage (I've still been trying and tinkering with stuff / different
approaches to problems; but for what it's worth, the current codebase and
the live backend are in decent shape.)
Comments / suggestions / barbaric yawps?
: see e.g. http://ts.mkj.lt:5555/details
# there is no 3. It's.. gone!
# more numbers have mysteriously disappeared! [API doc commit would go here]
Kostas (wfn on #tor-dev)
0x0e5dce45 @ pgp.mit.edu
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the tor-dev