[tor-dev] [GSoC 2013] Status report - Searchable metrics archive
kostas at jakeliunas.com
Fri Aug 23 12:54:55 UTC 2013
Updating on my Searchable Tor metrics archive project. (As is very evident)
I'm very open for naming suggestions. :)
To the best of my understanding and current satisfaction, I solved the
database bottlenecks, or at least I am, as of now, satisfied with the
current output from my benchmarking utility. Things may change, but I am
confident (and have support to argue) that the whole thing runs swell at
least on amazon m2.2xlarge instances.
For fun and profit, a part of the database (which, has, for now, status
entries (only) in the range [2010-01-01 00:00:00, 2013-05-31 23:00:00]),
namely, what is currently used by the Onionoo-like API is now available
online (not on EC2, though) - will now write a separate email so that
everyone can inspect it.
I should now move on with implementing / extending the Onionoo API, in
particular, working on date range queries, and refining/rewriting the "list
status entries" API point (see below). Need to carefully plan some things,
and always keep an updated API document. (Also need to update and publish a
separate, more detailed specification document.)
More concrete report points:
- re-examined my benchmarking approach, and wrote a rather simple but
effective set of benchmarking tools (more like a simple script)  that
can be hopefully used outside this project as well; at the very least,
together with the profiling and the query_info tools, it is powerful (but
also simple) enough to be used to test all kinds of bottlenecks in ORMs and
- used this tool to generate benchmark reports on EC2 and on the (less
powerful) dev server, and with different schema settings (usually rather
minor schema changes that do not require re-importing all the data)
- came up with a triple table schema that proves to render our queries
quickly: we first do a search (using whatever criteria (e.g. nickname,
fingerprint, address, running), if any) on a table which has a column with
unique fingerprints; extract the relevant fingerprints; JOIN with the main
status entry table, which is much larger; and get the final results.
Benchmarked using this schema.
<details> If we are only extracting a list of the latest status entries
(with distinct on fingerprint), we can do LIMITs and OFFSETs already on the
fingerprint table, before the JOIN. This helps us quite a bit. On the other
hand, nickname searches etc. are also efficient. As of now, I have
re-enabled nickname+address+fingerprint substring search (not from the
middle (LIKE %substring%), but from the beginning of a substring (LIKE
substring%), which is still nice), and all is well. Updated the
higher-level ORM to reflect this new table  (I've yet to change some
column names, though - but these are cosmetics.) </details>
- found a way to generate the SQL queries that I need to generate using
the higher-level SQLAlchemy SQL API using various SQLAlchemy-provided
primitives, and always observing the resulting query statements. This is
good, because everything becomes more modular: much easier to shape the
query depending on the query parameters received, etc. (while still
retaining it in sane order.)
- hence (re)wrote a part of the Onionoo-like API that uses the new
schema and the SQLAlchemy primitives. Extended the API a bit. 
- wrote a very hacky API point for getting a list of status entries for
a given fingerprint. I simply wanted a way (for myself and people) to query
this kind of a relation easily and externally. It now works as part of the
API. This part will probably need some discussion.
- wrote a (kind of a stub) document explaining the current Onionoo-like
API, what can be queried, what can be returned, what kinds of parameters
work.  Will extend this later on.
while writing the doc and rewriting part the API, stumbled upon a few
things that make clear that I've made some shortcuts that may hurt later
on. Will be happy to elaborate on them later on / separately. I need to
carefully plan a few things, and then try rewriting the Onionoo API yet
again, this time including more parameters and fields returned.
TL;DR yay, a working database backend!
I might give *one* more update detailing things I might have forgotten
about soon re: this report - I don't want to make a habit of delaying
reports (which I have consistently done), so reporting what I have now.
Kostas (wfn on OFTC)
0x0e5dce45 @ pgp.mit.edu
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the tor-dev