this won't be a very short email, but I honestly swear I did revise it a couple of times. :) This is not urgent by any measure, so whenever you find time to reply will be fine. ctrl+f to "observe:" for some precise data / support for my plan re: using the pg_prewarm extension.
On 8/10/13 9:28 PM, Kostas Jakeliunas wrote:
> * I don't think we can avoid using certain postgresql extensions (if only
> one) - which means that deploying will always take more than apt-get && pip
> install, but I believe it is needed;
Can you give an example of a query that won't be executed efficiently
without this extension and just fine with it? Maybe we can tweak that
query somehow so it works fine on a vanilla PostgreSQL. Happy to give
that some thoughts.
I'd really want to avoid using stuff that is not in Debian. Or rather,
if we really need to add non-standard extensions, we need more than
thinking and believing that it's unavoidable. :)
First off, the general idea. I know this might not sound convincing (see below re: this), but any query that uses an index will take significantly longer to execute if it needs to load parts of the index from disk. More precisely, query time deviation and max(query_time) inversely correlates with the percentage of the index in question in memory. The larger the index, the more difficult it is to 'prep' it into cache, the more unpredictable query exec time gets.
Take a look at the query used to join descriptors and network statuses given some nickname (could be any other criterion, e.g. fingerprint or IP address):
We use the following indexes while executing that query:
* lower(nickname) on descriptor
* (substr(fingerprint, 0, 12), substr(lower(digest), 0, 12)) on statusentry
(this one is used to efficiently join descriptor table with statusentry: (fingerprint, descriptor) pair is completely unique in the descriptor table, and it is fairly unique in the statusentry table (whereas a particular fingerprint usually has lots and lots of rows in statusentry));
this index uses only substrings because otherwise, it will hog memory on my remote development machine (not EC2), leaving not much for other indexes; this composite substring index still takes ~2.5GB for status entries (only) in the range between [2010-01; 2013-05] as of now
* validafter on statusentry (the latter *must* stay in memory, as we use it elsewhere as well; for example, when not given a particular search criterion, we want to return a list of status entries (with distinct fingerprints) sorted by consensus validafter in descending order)
We also want to keep a fingerprint index on the descriptor table because we want to be able to search / look up by fingerprint.
I'm thinking of a way to demonstrate the efficiency of having the whole index in memory. For now, let me summarize what I have observed, intersect with what is relevant now: running the aforementioned query on some nickname that we haven't queried for since the last restart of postgresql, it might take, on average, about 1.5 to 3 seconds to execute on EC2, and considerably longer on my development db if it is a truly popular nickname (otherwise, more or less the same amount of time); sometimes a bit longer - up to ~4s (ideally it should be rather uniform since the indexes are *balanced* trees, but.. and autovacuum is enabled.)
Running that same query later on (after we've run other queries after that first one), it will take <= 160ms to execute and return results (this is a conservative number, usually it's much faster (see below)). Running EXPLAIN (ANALYZE, BUFFERS) shows that what happened was that there was no "[disk] read" next to index operations - only "buffer hit". This means that there was no need to read from disk during all the sorting - only when we knew which rows to return did we need to actually read them from disk. (There are some nuances, but at least this will be true for PostgreSQL >= 9.2 [1], which I haven't tried yet - there might be some pleasant surprises re: query time. Last I checked, Debian 9.0 repository contains postgresql 9.1.9.)
Observe:
1a. Run that query looking for 'moria2' for the first time since postgresql restart - relay is an old one, only one distinct fingerprint, relatively few status entries:
http://sprunge.us/cEGh
2a. Run query on 'gabelmoo' (a ton of status entries) for the first time (development machine, query time is rather insane indeed):
http://sprunge.us/fQEK
PostgresSQL is rather clever: it will keep the parts of indexes more often used in cache. What pg_prewarm simply does is:
* load all (or critical for us) indexes to memory (and load them whole), which is possible given a large enough cache;
* load them to shared_buffers, which is an internal pgsql cache. pgsql uses two types of cache - its internal one (the memory for which is preallocated on daemon start), and OS cache - the latter is faster, but pgsql has not much weight in determining what gets kept there, and what gets flushed out.
If the shared_buffers value is large enough (say, 8GB) and all our relevant indexes fit into it nicely, postgresql does not flush them out from its internal cache - there's no need for it.
The cached indexes make a lot of sense when we make very non-discriminatory queries. What I mean is, if we query for status entries WHERE lower(nickname) = 'default' or 'unnamed', lots of index scanning will take place. EC2 executes such queries very nicely if the part of lower(nickname) index for, e.g., 'default' is loaded into cache. For that, we need enough memory and we need the indexes to be loaded into it. It doesn't happen automatically.
It might be possible to manually prepare the indexes by extracting, for example, the most common nicknames, and then simply running queries on them, thus pushing those parts of indexes into memory. As far as I understand, however, there is no guarantee that those parts will be put into the internal shared_buffers cache, which means that they might get pushed out of cache by the underlying OS depending on $some_uncontrollable_factor. This is my present understanding, but I can look into this more if there is a need for it.
Regarding the particular query at hand, I will agree it does look hairy. See the comments there - I can try to rephrase them and format them into a more readable form maybe? Looking up a particular nickname on the statusentry table is an expensive operation because, well, there are so many rows featuring it (usually). The descriptor table is much smaller. Removing the inner DISTINCT ON (descriptor.fingerprint) will result in a less hairy-looking EXPLAIN, but will take considerably longer to execute. So first, we narrow down the fingerprints using the descriptor table, extract a list of unique fingerprints (this particular operation is not expensive), then get a list of status entries featuring (fingerprint, descriptor) pairs, and again extract unique fingerprints (there multiple status entries to a particular (fp, descriptor) pair of course), sorting by validafter desc.
Regarding installing the extension in question / preparing a debian instance for this whole task, from a sysadmin perspective, the whole preparation can be reduced to a single chef script / makefile / whatnot; it will be a single command. But I understand that using additional things means an increased potential edge case and maintenance surface, so to speak. Note that the patch for pg_prewarm in question [2] is rather clean extension in the sense that it doesn't patch 'into' any postgresql core files; it creates new files and adds itself into an extension listing; standard stuff for extensions. Still, if this is something that sounds too risky (re: future integration/support/maintenance), I completely understand.
I plan to run some stuff on EC2 again soon, and install pg_prewarm there, and record query times, EXPLAIN BUFFERS, etc. Let me know what I should do to further prove/disprove my point.
> Overall, I'm spending a bit too much time on a specific problem, but at
> least I have a more intimate lower-level knowledge of PostgreSQL, which
> turns out to be very relevant to this project. I hope to be able to soon
> move to extending Onionoo support and providing a clean API for getting
> lists of consensuses in which a particular relay was present. And maybe
> start with the frontend. :)
Sounds like a plan, except for the frontend part. Scratch that, I'd
say, and focus on the API. Making it shiny can come once everything
works as expected.
Understood and makes sense, OK. I won't start thinking about the frontend until there's a jolly awesome backend running.
Cheers
Kostas.