[tor-dev] Python ExoneraTor

Karsten Loesing karsten at torproject.org
Tue Jun 17 07:13:09 UTC 2014


Hi Kostas,

On 11/06/14 04:48, Kostas Jakeliunas wrote:
> Hi all!
> 
> On Mon, Jun 9, 2014 at 10:22 AM, Karsten Loesing <karsten at torproject.org> wrote:
>> On 09/06/14 01:26, Damian Johnson wrote:
>>> Oh, and another quick thought - you once mentioned that a descriptor
>>> search service would make ExoneraTor obsolete, and in looking it over
>>> I agree. The search functionality ExoneraTor provides is trivial. The
>>> only reason it requires such a huge database is because it's storing a
>>> copy of every descriptor ever made.
>>>
>>> I suspect the actual right solution isn't to rewrite ExoneraTor at
>>> all, but rather develop a new service that can be queried for this
>>> descriptor data. That would make for a *much* more worthwhile project.
>>>
>>> ExoneraTor? Nice to have. Descriptor archive service? Damn useful. :)
>>
>> I agree, that was the idea behind Kostas' GSoC project last year.  And I
>> still think it's a good idea.  It's just not trivial to get right.
> 
> Indeed, not trivial at all!
> 
> I'll use this space to mention the running metrics archive backend
> modulo ExoneraTor stuff / what could be sorta-relevant here.
> 
> fwiw, the onionoo-like backend is still running at an obscure address:port:
> http://ts.mkj.lt:5555/

Would you want to put the summary you wrote here to that link?

And would you want me to add a sentence or two about your service
together with a link to the CollecTor page?

https://collector.torproject.org/#references

What would I write?

> TL;DR "what can I do with that" is: look at:
> 
> https://github.com/wfn/torsearch/blob/master/docs/use_cases_examples.md
> 
> In particular, regarding ExoneraTor-like queries (incl. arbitrary
> subnet / part-of-ip lookups):
> 
> https://github.com/wfn/torsearch/blob/master/docs/use_cases_examples.md#exonerator-type-relay-participation-lookup
> 
> Not sure if it's worth discussing all the weaknesses of this archive
> backend in this thread, but the short relevant version is that the
> ExoneraTor-like functionality does mostly work, but I would need to
> look into it so see how reliable the results are ("is this relay ip
> address field really the one we should be using?", etc.)
> 
> But what's nice is that it is possible to do arbitrary queries on all
> consensuses since ~2008, with no date specified (if you don't want
> to.) (Which is to say, "it's possible", not necessarily "this is the
> right way to do the solution for the problems in this thread")
> 
> So e.g. this is the ip address where moria runs, and we want to see
> what relays have ever run on it:
> 
> http://ts.mkj.lt:5555/details?search=128.31.0.34
> 
> Take the fingerprint of the one that is currently running (moria1),
> and look up its last 500 statuses (in a kind of condensed/summary
> form): http://ts.mkj.lt:5555/statuses?lookup=9695DFC35FFEB861329B9F1AB04C46397020CE31&condensed=true
> 
> "from", "to" date ranges can be specified as e.g. 2009, 2009-02,
> 2009-02-10, 2009-02-10 02:00:00. limit/offset/parameters/etc.
> specified here:
> https://github.com/wfn/torsearch/blob/master/docs/onionoo_api.md
> 
> (Descriptors/digests aren't currently included (I think they used to),
> but they can be, etc.)
> 
> The point is probably mostly about "this is some evidence that it can be done."
> ("But there are nuances, things are imperfect, time is needed, etc.")
> 
> The question really is regarding the actual scope of this rewrite, I suppose.
> 
> I'd probably agree with Karsten that just doing a port of the
> ExoneraTor functionality as it currently is on
> exonerator.torproject.org would be the safe bet. See how that goes,
> venture into more exotic lands later on maybe, etc. (That doesn't mean
> that I wouldn't be excited to put the current backend to good use,
> and/or use the knowledge I gained to help you folks in some way!)
> 
>>
>> Regarding your comment about storing a copy of every descriptor ever
>> made, I believe that users trust ExoneraTor's results more if they see
>> the actual descriptors that lead to results.  Of course, I'm saying that
>> without knowing what ExoneraTor users actually want.  But let's not drop
>> descriptor copies from the database easily.
>>
>> And, heh, when you say that the search functionality ExoneraTor provides
>> is trivial, a little part of me is dying.  It's the part that spent a
>> few weeks on getting the search functionality fast enough for
>> production.  That was not at all trivial.  The oraddress24, oraddress48,
>> and exitaddress24 fields as well as the indexes are the result of me
>> running lots and lots of sample queries and wondering about Postgres'
>> EXPLAIN ANALYZE results.  Just saying that it's not going to be trivial
>> to generalize the search functionality towards other fields than IP
>> addresses and dates.
> 
> Hear hear, I can only imagine! These things and exonerator stuff is
> not easy to be done in a way that would provide **consistently**
> good/great performance.
> 
> I spent some days of the last summer also looking at EXPLAIN ANALYZE
> results (it was a great feeling to start to understand what they mean
> and how I can make them better), but eventually things start making
> sense. (And when they do, I also get that same feeling that NoSQL
> stuff doesn't magically solve things.)
> 
>>
>> If others want to follow, here's the SQL code I'm talking about:
>>
>> https://gitweb.torproject.org/exonerator.git/blob/HEAD:/db/exonerator.sql
>>
>> So, I'm happy to talk about writing a searchable descriptor archive.  It
>> could _start_ with ExoneraTor's functionality (minus the target address
>> and port thing discussed in that other email), and then we could
>> consider adding more searches.
> 
> fwiw, imho this sounds like a sane plan to me. (Of course it could
> also be possible to work on the onionoo-like archive backend (or fork
> it, or smash it into parts and steal some of them, etc., but I can see
> why this might yield unclear deliverables, etc.) (So a short document
> of "what is wanted" would help, yeah.)
> 
>>
>> Pretty sure that Kostas is reading this (in fact, I just cc'ed him), so
>> let me make one remark about optimizing Postgres defaults: I wrote quite
>> a few database queries in the past, and some of them perform horribly
>> (relay search) whereas others perform really well (ExoneraTor).  I
>> believe that the majority of performance gains can be achieved by
>> designing good tables, indexes, and queries.  Only as a last resort we
>> should consider optimizing the Postgres defaults.
> 
> Ha, at this point I probably have a sort of "premature optimizer"
> label in your mind, Karsten. :) (And I kinda deserved it by at one
> point focusing on very-low-level postgres caching mechanisms last
> summer, etc etc.)
> 
> I've actually come to really appreciate good schema and query
> design[1] and the wonders that they do. That being said, I'd actually
> be curious to know how large the indexes of relay-search and current
> exonerator are.[2] I (still) bet increasing postgres' shared_buffers
> and effective_cache_size (totally normal practice!) might help! (Oh,
> is this one of those vim-vs-emacs things? If it is, sorry.)

I just deleted most of the database contents behind the relay-search
service a few days ago.  But I might even have agreed there that some
PostgreSQL tweaking would have helped.  It was a bad database design,
mostly because it was built for a different purpose (data aggregation
for metrics website), so it's a bad example.

But let me give you some numbers on current ExoneraTor (manually deleted
part of the output which we don't care about here):

exonerator=> \dt+
     Name      |  Size
---------------+--------
 consensus     | 16 GB
 descriptor    | 31 GB
 exitlistentry | 558 MB
 statusentry   | 50 GB
(4 rows)

exonerator=> \di+
                  Name                   |     Table     |  Size
-----------------------------------------+---------------+---------
 consensus_pkey                          | consensus     | 1280 kB
 descriptor_pkey                         | descriptor    | 1930 MB
 exitlistentry_exitaddress24_scanneddate | exitlistentry | 82 MB
 exitlistentry_exitaddress_scanneddate   | exitlistentry | 82 MB
 exitlistentry_pkey                      | exitlistentry | 173 MB
 statusentry_oraddress24_validafterdate  | statusentry   | 5470 MB
 statusentry_oraddress48_validafterdate  | statusentry   | 4629 MB
 statusentry_oraddress_validafterdate    | statusentry   | 5509 MB
 statusentry_pkey                        | statusentry   | 10 GB
(9 rows)

Happy to run some EXPLAIN ANALYZE queries for you if you tell me what to
run.

If we're going to optimize the ExoneraTor database, should we move this
discussion to a ticket?

All the best,
Karsten


> But the point is that (to invoke a cliche) there is no free lunch, and
> (2) postgresql can really do wonders and scale well when used right.
> 
>>
>> You realize that a searchable descriptor archives focuses much more on
>> database optimization than the ExoneraTor rewrite from Java to Python
>> (which would leave the database untouched)?
>>
> 
> "leaving database untouched" probably implies (very) significantly
> less work, so it would be a nice/clear starting point. (caveat, i may
> be lacking context, etc.)
> 
> 
> [1]: also, fun things like "sometimes indexes won't be used because a
> sequential read will be faster, because if parts of indexes to be used
> are in various parts across the disk (not all of them are in memory),
> random seek + read a bit into memory + repeat is slower than 'just
> read a lot of continuous data into memory'", etc etc.)
> 
> [2]: if you're feeling adventuruous, you can run this on each of
> postgres databases, to see how large the indexes (among all other
> things) are, and which parts of them are loaded into memory
> https://github.com/wfn/torsearch/blob/master/misc/buffercache.sql
> 
> --
> 
> Kostas.
> 
> 0x0e5dce45 @ pgp.mit.edu
> 



More information about the tor-dev mailing list