[tor-dev] [GSoC 2013] Status report - Searchable metrics archive

Kostas Jakeliunas kostas at jakeliunas.com
Fri Aug 23 12:54:55 UTC 2013


Hello!

Updating on my Searchable Tor metrics archive project. (As is very evident)
I'm very open for naming suggestions. :)

To the best of my understanding and current satisfaction, I solved the
database bottlenecks, or at least I am, as of now, satisfied with the
current output from my benchmarking utility. Things may change, but I am
confident (and have support to argue) that the whole thing runs swell at
least on amazon m2.2xlarge instances.

For fun and profit, a part of the database (which, has, for now, status
entries (only) in the range [2010-01-01 00:00:00, 2013-05-31 23:00:00]),
namely, what is currently used by the Onionoo-like API is now available
online (not on EC2, though) - will now write a separate email so that
everyone can inspect it.

I should now move on with implementing / extending the Onionoo API, in
particular, working on date range queries, and refining/rewriting the "list
status entries" API point (see below). Need to carefully plan some things,
and always keep an updated API document. (Also need to update and publish a
separate, more detailed specification document.)

More concrete report points:

   - re-examined my benchmarking approach, and wrote a rather simple but
   effective set of benchmarking tools (more like a simple script) [1] that
   can be hopefully used outside this project as well; at the very least,
   together with the profiling and the query_info tools, it is powerful (but
   also simple) enough to be used to test all kinds of bottlenecks in ORMs and
   elsewhere.

   - used this tool to generate benchmark reports on EC2 and on the (less
   powerful) dev server, and with different schema settings (usually rather
   minor schema changes that do not require re-importing all the data)

   - came up with a triple table schema that proves to render our queries
   quickly: we first do a search (using whatever criteria (e.g. nickname,
   fingerprint, address, running), if any) on a table which has a column with
   unique fingerprints; extract the relevant fingerprints; JOIN with the main
   status entry table, which is much larger; and get the final results.
   Benchmarked using this schema.

   <details> If we are only extracting a list of the latest status entries
   (with distinct on fingerprint), we can do LIMITs and OFFSETs already on the
   fingerprint table, before the JOIN. This helps us quite a bit. On the other
   hand, nickname searches etc. are also efficient. As of now, I have
   re-enabled nickname+address+fingerprint substring search (not from the
   middle (LIKE %substring%), but from the beginning of a substring (LIKE
   substring%), which is still nice), and all is well. Updated the
   higher-level ORM to reflect this new table [2] (I've yet to change some
   column names, though - but these are cosmetics.) </details>

   - found a way to generate the SQL queries that I need to generate using
   the higher-level SQLAlchemy SQL API using various SQLAlchemy-provided
   primitives, and always observing the resulting query statements. This is
   good, because everything becomes more modular: much easier to shape the
   query depending on the query parameters received, etc. (while still
   retaining it in sane order.)

   - hence (re)wrote a part of the Onionoo-like API that uses the new
   schema and the SQLAlchemy primitives. Extended the API a bit. [3]

   - wrote a very hacky API point for getting a list of status entries for
   a given fingerprint. I simply wanted a way (for myself and people) to query
   this kind of a relation easily and externally. It now works as part of the
   API. This part will probably need some discussion.

   - wrote a (kind of a stub) document explaining the current Onionoo-like
   API, what can be queried, what can be returned, what kinds of parameters
   work. [4] Will extend this later on.


while writing the doc and rewriting part the API, stumbled upon a few
things that make clear that I've made some shortcuts that may hurt later
on. Will be happy to elaborate on them later on / separately. I need to
carefully plan a few things, and then try rewriting the Onionoo API yet
again, this time including more parameters and fields returned.

TL;DR yay, a working database backend!

I might give *one* more update detailing things I might have forgotten
about soon re: this report - I don't want to make a habit of delaying
reports (which I have consistently done), so reporting what I have now.

[1]: https://github.com/wfn/torsearch/blob/master/torsearch/benchmark.py
[2]: https://github.com/wfn/torsearch/blob/master/torsearch/models.py
[3]: https://github.com/wfn/torsearch/blob/master/torsearch/onionoo_api.py
[4]: https://github.com/wfn/torsearch/blob/master/docs/onionoo_api.md

--

Kostas (wfn on OFTC)

0x0e5dce45 @ pgp.mit.edu
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.torproject.org/pipermail/tor-dev/attachments/20130823/1c886557/attachment.html>


More information about the tor-dev mailing list