[tor-dev] Python ExoneraTor
karsten at torproject.org
Tue Jun 17 07:13:09 UTC 2014
On 11/06/14 04:48, Kostas Jakeliunas wrote:
> Hi all!
> On Mon, Jun 9, 2014 at 10:22 AM, Karsten Loesing <karsten at torproject.org> wrote:
>> On 09/06/14 01:26, Damian Johnson wrote:
>>> Oh, and another quick thought - you once mentioned that a descriptor
>>> search service would make ExoneraTor obsolete, and in looking it over
>>> I agree. The search functionality ExoneraTor provides is trivial. The
>>> only reason it requires such a huge database is because it's storing a
>>> copy of every descriptor ever made.
>>> I suspect the actual right solution isn't to rewrite ExoneraTor at
>>> all, but rather develop a new service that can be queried for this
>>> descriptor data. That would make for a *much* more worthwhile project.
>>> ExoneraTor? Nice to have. Descriptor archive service? Damn useful. :)
>> I agree, that was the idea behind Kostas' GSoC project last year. And I
>> still think it's a good idea. It's just not trivial to get right.
> Indeed, not trivial at all!
> I'll use this space to mention the running metrics archive backend
> modulo ExoneraTor stuff / what could be sorta-relevant here.
> fwiw, the onionoo-like backend is still running at an obscure address:port:
Would you want to put the summary you wrote here to that link?
And would you want me to add a sentence or two about your service
together with a link to the CollecTor page?
What would I write?
> TL;DR "what can I do with that" is: look at:
> In particular, regarding ExoneraTor-like queries (incl. arbitrary
> subnet / part-of-ip lookups):
> Not sure if it's worth discussing all the weaknesses of this archive
> backend in this thread, but the short relevant version is that the
> ExoneraTor-like functionality does mostly work, but I would need to
> look into it so see how reliable the results are ("is this relay ip
> address field really the one we should be using?", etc.)
> But what's nice is that it is possible to do arbitrary queries on all
> consensuses since ~2008, with no date specified (if you don't want
> to.) (Which is to say, "it's possible", not necessarily "this is the
> right way to do the solution for the problems in this thread")
> So e.g. this is the ip address where moria runs, and we want to see
> what relays have ever run on it:
> Take the fingerprint of the one that is currently running (moria1),
> and look up its last 500 statuses (in a kind of condensed/summary
> form): http://ts.mkj.lt:5555/statuses?lookup=9695DFC35FFEB861329B9F1AB04C46397020CE31&condensed=true
> "from", "to" date ranges can be specified as e.g. 2009, 2009-02,
> 2009-02-10, 2009-02-10 02:00:00. limit/offset/parameters/etc.
> specified here:
> (Descriptors/digests aren't currently included (I think they used to),
> but they can be, etc.)
> The point is probably mostly about "this is some evidence that it can be done."
> ("But there are nuances, things are imperfect, time is needed, etc.")
> The question really is regarding the actual scope of this rewrite, I suppose.
> I'd probably agree with Karsten that just doing a port of the
> ExoneraTor functionality as it currently is on
> exonerator.torproject.org would be the safe bet. See how that goes,
> venture into more exotic lands later on maybe, etc. (That doesn't mean
> that I wouldn't be excited to put the current backend to good use,
> and/or use the knowledge I gained to help you folks in some way!)
>> Regarding your comment about storing a copy of every descriptor ever
>> made, I believe that users trust ExoneraTor's results more if they see
>> the actual descriptors that lead to results. Of course, I'm saying that
>> without knowing what ExoneraTor users actually want. But let's not drop
>> descriptor copies from the database easily.
>> And, heh, when you say that the search functionality ExoneraTor provides
>> is trivial, a little part of me is dying. It's the part that spent a
>> few weeks on getting the search functionality fast enough for
>> production. That was not at all trivial. The oraddress24, oraddress48,
>> and exitaddress24 fields as well as the indexes are the result of me
>> running lots and lots of sample queries and wondering about Postgres'
>> EXPLAIN ANALYZE results. Just saying that it's not going to be trivial
>> to generalize the search functionality towards other fields than IP
>> addresses and dates.
> Hear hear, I can only imagine! These things and exonerator stuff is
> not easy to be done in a way that would provide **consistently**
> good/great performance.
> I spent some days of the last summer also looking at EXPLAIN ANALYZE
> results (it was a great feeling to start to understand what they mean
> and how I can make them better), but eventually things start making
> sense. (And when they do, I also get that same feeling that NoSQL
> stuff doesn't magically solve things.)
>> If others want to follow, here's the SQL code I'm talking about:
>> So, I'm happy to talk about writing a searchable descriptor archive. It
>> could _start_ with ExoneraTor's functionality (minus the target address
>> and port thing discussed in that other email), and then we could
>> consider adding more searches.
> fwiw, imho this sounds like a sane plan to me. (Of course it could
> also be possible to work on the onionoo-like archive backend (or fork
> it, or smash it into parts and steal some of them, etc., but I can see
> why this might yield unclear deliverables, etc.) (So a short document
> of "what is wanted" would help, yeah.)
>> Pretty sure that Kostas is reading this (in fact, I just cc'ed him), so
>> let me make one remark about optimizing Postgres defaults: I wrote quite
>> a few database queries in the past, and some of them perform horribly
>> (relay search) whereas others perform really well (ExoneraTor). I
>> believe that the majority of performance gains can be achieved by
>> designing good tables, indexes, and queries. Only as a last resort we
>> should consider optimizing the Postgres defaults.
> Ha, at this point I probably have a sort of "premature optimizer"
> label in your mind, Karsten. :) (And I kinda deserved it by at one
> point focusing on very-low-level postgres caching mechanisms last
> summer, etc etc.)
> I've actually come to really appreciate good schema and query
> design and the wonders that they do. That being said, I'd actually
> be curious to know how large the indexes of relay-search and current
> exonerator are. I (still) bet increasing postgres' shared_buffers
> and effective_cache_size (totally normal practice!) might help! (Oh,
> is this one of those vim-vs-emacs things? If it is, sorry.)
I just deleted most of the database contents behind the relay-search
service a few days ago. But I might even have agreed there that some
PostgreSQL tweaking would have helped. It was a bad database design,
mostly because it was built for a different purpose (data aggregation
for metrics website), so it's a bad example.
But let me give you some numbers on current ExoneraTor (manually deleted
part of the output which we don't care about here):
Name | Size
consensus | 16 GB
descriptor | 31 GB
exitlistentry | 558 MB
statusentry | 50 GB
Name | Table | Size
consensus_pkey | consensus | 1280 kB
descriptor_pkey | descriptor | 1930 MB
exitlistentry_exitaddress24_scanneddate | exitlistentry | 82 MB
exitlistentry_exitaddress_scanneddate | exitlistentry | 82 MB
exitlistentry_pkey | exitlistentry | 173 MB
statusentry_oraddress24_validafterdate | statusentry | 5470 MB
statusentry_oraddress48_validafterdate | statusentry | 4629 MB
statusentry_oraddress_validafterdate | statusentry | 5509 MB
statusentry_pkey | statusentry | 10 GB
Happy to run some EXPLAIN ANALYZE queries for you if you tell me what to
If we're going to optimize the ExoneraTor database, should we move this
discussion to a ticket?
All the best,
> But the point is that (to invoke a cliche) there is no free lunch, and
> (2) postgresql can really do wonders and scale well when used right.
>> You realize that a searchable descriptor archives focuses much more on
>> database optimization than the ExoneraTor rewrite from Java to Python
>> (which would leave the database untouched)?
> "leaving database untouched" probably implies (very) significantly
> less work, so it would be a nice/clear starting point. (caveat, i may
> be lacking context, etc.)
> : also, fun things like "sometimes indexes won't be used because a
> sequential read will be faster, because if parts of indexes to be used
> are in various parts across the disk (not all of them are in memory),
> random seek + read a bit into memory + repeat is slower than 'just
> read a lot of continuous data into memory'", etc etc.)
> : if you're feeling adventuruous, you can run this on each of
> postgres databases, to see how large the indexes (among all other
> things) are, and which parts of them are loaded into memory
> 0x0e5dce45 @ pgp.mit.edu
More information about the tor-dev