[tor-dev] [GSoC 2013] Status report - Searchable metrics archive

Kostas Jakeliunas kostas at jakeliunas.com
Mon Aug 12 12:58:12 UTC 2013


Karsten,

this won't be a very short email, but I honestly swear I did revise it a
couple of times. :) This is not urgent by any measure, so whenever you find
time to reply will be fine. ctrl+f to "observe:" for some precise data /
support for my plan re: using the pg_prewarm extension.

On Mon, Aug 12, 2013 at 2:16 PM, Karsten Loesing <karsten at torproject.org>wrote:

> On 8/10/13 9:28 PM, Kostas Jakeliunas wrote:
> >   * I don't think we can avoid using certain postgresql extensions (if
> only
> > one) - which means that deploying will always take more than apt-get &&
> pip
> > install, but I believe it is needed;
>
> Can you give an example of a query that won't be executed efficiently
> without this extension and just fine with it?  Maybe we can tweak that
> query somehow so it works fine on a vanilla PostgreSQL.  Happy to give
> that some thoughts.
>
> I'd really want to avoid using stuff that is not in Debian.  Or rather,
> if we really need to add non-standard extensions, we need more than
> thinking and believing that it's unavoidable. :)


First off, the general idea. I know this might not sound convincing (see
below re: this), but any query that uses an index will take significantly
longer to execute if it needs to load parts of the index from disk. More
precisely, query time deviation and max(query_time) inversely correlates
with the percentage of the index in question in memory. The larger the
index, the more difficult it is to 'prep' it into cache, the more
unpredictable query exec time gets.

Take a look at the query used to join descriptors and network statuses
given some nickname (could be any other criterion, e.g. fingerprint or IP
address):

https://github.com/wfn/torsearch/blob/master/misc/nested_join.sql

We use the following indexes while executing that query:

 * lower(nickname) on descriptor

 * (substr(fingerprint, 0, 12), substr(lower(digest), 0, 12)) on statusentry
(this one is used to efficiently join descriptor table with statusentry:
(fingerprint, descriptor) pair is completely unique in the descriptor
table, and it is fairly unique in the statusentry table (whereas a
particular fingerprint usually has lots and lots of rows in statusentry));
this index uses only substrings because otherwise, it will hog memory on my
remote development machine (not EC2), leaving not much for other indexes;
this composite substring index still takes ~2.5GB for status entries (only)
in the range between [2010-01; 2013-05] as of now

 * validafter on statusentry (the latter *must* stay in memory, as we use
it elsewhere as well; for example, when not given a particular search
criterion, we want to return a list of status entries (with distinct
fingerprints) sorted by consensus validafter in descending order)

We also want to keep a fingerprint index on the descriptor table because we
want to be able to search / look up by fingerprint.

I'm thinking of a way to demonstrate the efficiency of having the whole
index in memory. For now, let me summarize what I have observed, intersect
with what is relevant now: running the aforementioned query on some
nickname that we haven't queried for since the last restart of postgresql,
it might take, on average, about 1.5 to 3 seconds to execute on EC2, and
considerably longer on my development db if it is a truly popular nickname
(otherwise, more or less the same amount of time); sometimes a bit longer -
up to ~4s (ideally it should be rather uniform since the indexes are
*balanced* trees, but.. and autovacuum is enabled.)

Running that same query later on (after we've run other queries after that
first one), it will take <= 160ms to execute and return results (this is a
conservative number, usually it's much faster (see below)). Running EXPLAIN
(ANALYZE, BUFFERS) shows that what happened was that there was no "[disk]
read" next to index operations - only "buffer hit". This means that there
was no need to read from disk during all the sorting - only when we knew
which rows to return did we need to actually read them from disk. (There
are some nuances, but at least this will be true for PostgreSQL >= 9.2 [1],
which I haven't tried yet - there might be some pleasant surprises re:
query time. Last I checked, Debian 9.0 repository contains postgresql
9.1.9.)

Observe:

1a. Run that query looking for 'moria2' for the first time since postgresql
restart - relay is an old one, only one distinct fingerprint, relatively
few status entries: http://sprunge.us/cEGh

1b. Run that same query later on: http://sprunge.us/jiPg (notice: no reads,
only hits; notice query time)

2a. Run query on 'gabelmoo' (a ton of status entries) for the first time
(development machine, query time is rather insane indeed):
http://sprunge.us/fQEK

2b. Run that same query on 'gablemoo' later on: http://sprunge.us/fDDV

PostgresSQL is rather clever: it will keep the parts of indexes more often
used in cache. What pg_prewarm simply does is:

 * load all (or critical for us) indexes to memory (and load them whole),
which is possible given a large enough cache;
 * load them to shared_buffers, which is an internal pgsql cache. pgsql
uses two types of cache - its internal one (the memory for which is
preallocated on daemon start), and OS cache - the latter is faster, but
pgsql has not much weight in determining what gets kept there, and what
gets flushed out.

If the shared_buffers value is large enough (say, 8GB) and all our relevant
indexes fit into it nicely, postgresql does not flush them out from its
internal cache - there's no need for it.

The cached indexes make a lot of sense when we make very non-discriminatory
queries. What I mean is, if we query for status entries WHERE
lower(nickname) = 'default' or 'unnamed', lots of index scanning will take
place. EC2 executes such queries very nicely if the part of lower(nickname)
index for, e.g., 'default' is loaded into cache. For that, we need enough
memory and we need the indexes to be loaded into it. It doesn't happen
automatically.

It might be possible to manually prepare the indexes by extracting, for
example, the most common nicknames, and then simply running queries on
them, thus pushing those parts of indexes into memory. As far as I
understand, however, there is no guarantee that those parts will be put
into the internal shared_buffers cache, which means that they might get
pushed out of cache by the underlying OS depending on
$some_uncontrollable_factor. This is my present understanding, but I can
look into this more if there is a need for it.

Regarding the particular query at hand, I will agree it does look hairy.
See the comments there - I can try to rephrase them and format them into a
more readable form maybe? Looking up a particular nickname on the
statusentry table is an expensive operation because, well, there are so
many rows featuring it (usually). The descriptor table is much smaller.
Removing the inner DISTINCT ON (descriptor.fingerprint) will result in a
less hairy-looking EXPLAIN, but will take considerably longer to execute.
So first, we narrow down the fingerprints using the descriptor table,
extract a list of unique fingerprints (this particular operation is not
expensive), then get a list of status entries featuring (fingerprint,
descriptor) pairs, and again extract unique fingerprints (there multiple
status entries to a particular (fp, descriptor) pair of course), sorting by
validafter desc.

Regarding installing the extension in question / preparing a debian
instance for this whole task, from a sysadmin perspective, the whole
preparation can be reduced to a single chef script / makefile / whatnot; it
will be a single command. But I understand that using additional things
means an increased potential edge case and maintenance surface, so to
speak. Note that the patch for pg_prewarm in question [2] is rather clean
extension in the sense that it doesn't patch 'into' any postgresql core
files; it creates new files and adds itself into an extension listing;
standard stuff for extensions. Still, if this is something that sounds too
risky (re: future integration/support/maintenance), I completely understand.

I plan to run some stuff on EC2 again soon, and install pg_prewarm there,
and record query times, EXPLAIN BUFFERS, etc. Let me know what I should do
to further prove/disprove my point.

> Overall, I'm spending a bit too much time on a specific problem, but at
> > least I have a more intimate lower-level knowledge of PostgreSQL, which
> > turns out to be very relevant to this project. I hope to be able to soon
> > move to extending Onionoo support and providing a clean API for getting
> > lists of consensuses in which a particular relay was present. And maybe
> > start with the frontend. :)
> Sounds like a plan, except for the frontend part.  Scratch that, I'd
> say, and focus on the API.  Making it shiny can come once everything
> works as expected.


Understood and makes sense, OK. I won't start thinking about the frontend
until there's a jolly awesome backend running.

Cheers
Kostas.

[1]:
http://wiki.postgresql.org/wiki/What's_new_in_PostgreSQL_9.2#Index-only_scans
[2]:
http://www.postgresql.org/message-id/attachment/12593/pg_prewarm_v1.patch
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.torproject.org/pipermail/tor-dev/attachments/20130812/950a7ac8/attachment.html>


More information about the tor-dev mailing list