[tor-dev] [GSoC 2013] Status report - Searchable metrics archive

Karsten Loesing karsten at torproject.org
Wed Aug 14 10:33:22 UTC 2013


On 8/13/13 3:17 PM, Kostas Jakeliunas wrote:
> On Tue, Aug 13, 2013 at 2:15 PM, Karsten Loesing <karsten at torproject.org>wrote:
>>
>> I suggest putting pg_prewarm on the future work list.  I sense there's a
>> lot of unused potential in stock PostgreSQL.  Tweaking the database at
>> this point has the word "premature optimization" written on it in big
>> letters for me.
> 
> 
>> Also, to be very clear here, a tool that requires custom tweaks to
>> PostgreSQL has minimal chances of running on torproject.org machines in
>> the future.  The current plan is that we'll have a dedicated database
>> machine operated by our sysadmins that not even the service operator
>> will have shell access to.
> 
> 
> Oh, understood then, OK, no extensions (at least) for now.
> 
> Apropos: as of my current (limited) understanding, it might be difficult to
> support, for example, nickname sub-string searches without a (supported,
> official) extension. One such extension is pg_trgm [1], which is in the
> contrib/ directory in 9.1, and is just one make install away. But for now,
> I'll assume this is not possible / we should avoid this.

Looks like pg_trgm is contained in postgresql-contrib-9.1, so it's more
likely that we can run something requiring this extension on a
torproject.org machine.  Still, requiring extensions should be the last
resort if no other solution can be found.  Leaving out searches for
nickname substrings is a valid solution for now.

>> So, why do you join descriptors and network statuses in the search
>> process?  At the Munich dev meeting I suggested joining the tables
>> already in the import process.  What do you think about that idea?
> 
> 
> Yes, I had made a half-hearted attempt to normalize the two tables some
> time ago, for a small amount of descriptors and status entries; I'll be
> trying out this scheme in full (will need to re-import a major part of the
> data (which I didn't do then) to be able to see if it scales well) after I
> try something else.

Okay.

> (Namely, using a third table of unique fingerprints
> (the statusentry table currently holds ~170K unique fingerprints vs. ~67M
> rows in total) and (non-unique) nicknames for truly quick fingerprint
> lookup and nickname search; I did experiment with this as well, but I
> worked with a small subset of overall data in that case, too; and I think I
> can do a better job now.)
> 
> It had seemed to me that the bottleneck was in having to sort a too large
> number of rows, but now I understand (if only just a bit) more about the
> 'explain analyze' output to see that the 'Nested Loop' procedure, which is
> what does the join in the join query discussed, is expensive and is part of
> the bottleneck so to speak. So I'll look into that after properly
> benchmarking stuff with the third table.

Sounds like another fine thing to test, I agree.

> (By the way, for future reference,
> we do have to test out different ideas on a substantial subset of overall
> data, as the scale function is not, so to say, linear.) :)

Good point.  Yes, we should keep this in mind for the future.

>>> https://github.com/wfn/torsearch/blob/master/misc/nested_join.sql
>>>
>>> We use the following indexes while executing that query:
>>>
>>>  * lower(nickname) on descriptor
>>>
>>>  * (substr(fingerprint, 0, 12), substr(lower(digest), 0, 12)) on
>> statusentry
>>
>> Using only the first 12 characters sounds like a fine approach to speed
>> up things.  But why 12?  Why not 10 or 14?  This is probably something
>> you should annotate as parameter to find a good value for later in the
>> process.  (I'm not saying that 12 is a bad number.  It's perfectly fine
>> for now, but it might not be the best number.)
>>
> 
> Yes, this is as unscientific as it gets. As of now, we're using a raw SQL
> query, but I'll be encapuslating them properly soon (so we can easily
> attach different WHERE clauses, etc.), at which point I'll make it into a
> parameter. I did do some tests, but nothing extensive; just made sure the
> indexes can fit into memory whole, which was the main constraint. Will do
> some tests.
> 
> 
>> Also, would it keep indexes smaller if you took something else than
>> base16 encoding for fingerprints?  What about base64?  Or is there a
>> binary type in PostgreSQL that works fine for indexes?
>>
> 
> Re: latter, no binary type for B-Trees (which is the default index type in
> pgsql) as far as I can see. But it's a good idea / approach, so I'll look
> into it, thanks! On the whole though, as long as all the indexes occupy
> only a subset of pgsql's internal buffers, there shouldn't be a problem /
> that's not the problem, afaik. But, if we're making a well-researched
> ORM/database design, I should look into it.
> 
> 
>> Do you have a list of searches you're planning to support?
> 
> 
> These are the ones that should *really* be supported:
> 
>    - ?search=nickname
>    - ?search=fingerprint
>    - ?lookup=fingerprint
>    - ?search=address [done some limited testing, currently not focusing on
>    this]

The lookup parameter is basically the same as search=fingerprint with
the additional requirement that fingerprint must be 40 characters long.
 So, this is the current search parameter.

I agree, these would be good to support.

You might also add another parameter ?address=address for ExoneraTor.
That should, in theory, be just a subset of the search parameter.

>    - ?running=<boolean>

This one is tricky.  So far, Onionoo looks only at the very latest
consensus or bridge status to decide if a relay or bridge is running or not.

But now you're adding archives to Onionoo, so that people can search for
a certain consensus or certain bridge status in the past, or they can
search for a time interval of consensuses or bridge statuses.  How do
you define that a relay or bridge is running, or more importantly
included as not running?

>    - ?flag=flag [every kind of clause which further narrows down the query
>    is not bad; the current db model supports all the flags that Stem does, and
>    each flag has its own column]

I'd say leave this one out until there's an actual use case.

>    - ?first_seen_days=range
>    - ?last_seen_days=range
> 
> As per the plan, the db should be able to return a list of status entries /
> validafter ranges (which can be used in {first,last}_seen_days) given some
> fingerprint.

Oh, I think there's a misunderstanding of these two fields.  These
fields are only there to search for relays or bridges that have first
appeared or were last seen on a given day.

You'll need two new parameters, say, from=datetime and to=datetime (or
start=datetime and end=datetime) to define a valid-after range for your
search.

Hope this makes sense.

All the best,
Karsten



More information about the tor-dev mailing list