On Tue, Aug 13, 2013 at 2:15 PM, Karsten Loesing <span dir="ltr"><<a href="mailto:karsten@torproject.org" target="_blank">karsten@torproject.org</a>></span> wrote:<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">


<span style="color:rgb(34,34,34);font-family:arial,sans-serif;font-size:13px;background-color:rgb(255,255,255)">I suggest putting pg_prewarm on the future work list.  I sense there's a<br></span><span style="color:rgb(34,34,34);font-family:arial,sans-serif;font-size:13px;background-color:rgb(255,255,255)">lot of unused potential in stock PostgreSQL.  Tweaking the database at<br>


</span><span style="color:rgb(34,34,34);font-family:arial,sans-serif;font-size:13px;background-color:rgb(255,255,255)">this point has the word "premature optimization" written on it in big<br></span><span style="color:rgb(34,34,34);font-family:arial,sans-serif;font-size:13px;background-color:rgb(255,255,255)">letters for me.</span> </blockquote>


<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex"><br style="color:rgb(34,34,34);font-family:arial,sans-serif;font-size:13px;background-color:rgb(255,255,255)">


<span style="color:rgb(34,34,34);font-family:arial,sans-serif;font-size:13px;background-color:rgb(255,255,255)">Also, to be very clear here, a tool that requires custom tweaks to<br></span><span style="color:rgb(34,34,34);font-family:arial,sans-serif;font-size:13px;background-color:rgb(255,255,255)">PostgreSQL has minimal chances of running on </span><a href="http://torproject.org/" target="_blank" style="color:rgb(17,85,204);font-family:arial,sans-serif;font-size:13px;background-color:rgb(255,255,255)">torproject.org</a><span style="color:rgb(34,34,34);font-family:arial,sans-serif;font-size:13px;background-color:rgb(255,255,255)"> machines in<br>


</span><span style="color:rgb(34,34,34);font-family:arial,sans-serif;font-size:13px;background-color:rgb(255,255,255)">the future.  The current plan is that we'll have a dedicated database<br></span><span style="color:rgb(34,34,34);font-family:arial,sans-serif;font-size:13px;background-color:rgb(255,255,255)">machine operated by our sysadmins that not even the service operator<br>


</span><span style="color:rgb(34,34,34);font-family:arial,sans-serif;font-size:13px;background-color:rgb(255,255,255)">will have shell access to.</span></blockquote><div><br></div><div>Oh, understood then, OK, no extensions (at least) for now.</div>


<div><br></div><div>Apropos: as of my current (limited) understanding, it might be difficult to support, for example, nickname sub-string searches without a (supported, official) extension. One such extension is pg_trgm [1], which is in the contrib/ directory in 9.1, and is just one make install away. But for now, I'll assume this is not possible / we should avoid this.</div>


<div><br><div><div class="gmail_quote"><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div class="im">So, why do you join descriptors and network statuses in the search</div>


process?  At the Munich dev meeting I suggested joining the tables<br>

already in the import process.  What do you think about that idea?</blockquote><div><br></div><div>Yes, I had made a half-hearted attempt to normalize the two tables some time ago, for a small amount of descriptors and status entries; I'll be trying out this scheme in full (will need to re-import a major part of the data (which I didn't do then) to be able to see if it scales well) after I try something else. (Namely, using a third table of unique fingerprints (the statusentry table currently holds ~170K unique fingerprints vs. ~67M rows in total) and (non-unique) nicknames for truly quick fingerprint lookup and nickname search; I did experiment with this as well, but I worked with a small subset of overall data in that case, too; and I think I can do a better job now.)</div>


<div><br></div><div>It had seemed to me that the bottleneck was in having to sort a too large number of rows, but now I understand (if only just a bit) more about the 'explain analyze' output to see that the 'Nested Loop' procedure, which is what does the join in the join query discussed, is expensive and is part of the bottleneck so to speak. So I'll look into that after properly benchmarking stuff with the third table. (By the way, for future reference, we do have to test out different ideas on a substantial subset of overall data, as the scale function is not, so to say, linear.) :)</div>


<div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div class="im">

> <a href="https://github.com/wfn/torsearch/blob/master/misc/nested_join.sql" target="_blank">https://github.com/wfn/torsearch/blob/master/misc/nested_join.sql</a><br>

><br>

> We use the following indexes while executing that query:<br>

><br>

>  * lower(nickname) on descriptor<br>

><br>

>  * (substr(fingerprint, 0, 12), substr(lower(digest), 0, 12)) on statusentry<br>

<br>

</div>Using only the first 12 characters sounds like a fine approach to speed<br>

up things.  But why 12?  Why not 10 or 14?  This is probably something<br>

you should annotate as parameter to find a good value for later in the<br>

process.  (I'm not saying that 12 is a bad number.  It's perfectly fine<br>

for now, but it might not be the best number.)<br></blockquote><div><br></div><div>Yes, this is as unscientific as it gets. As of now, we're using a raw SQL query, but I'll be encapuslating them properly soon (so we can easily attach different WHERE clauses, etc.), at which point I'll make it into a parameter. I did do some tests, but nothing extensive; just made sure the indexes can fit into memory whole, which was the main constraint. Will do some tests.</div>


<div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

Also, would it keep indexes smaller if you took something else than<br>

base16 encoding for fingerprints?  What about base64?  Or is there a<br>

binary type in PostgreSQL that works fine for indexes?<br></blockquote><div><br></div><div>Re: latter, no binary type for B-Trees (which is the default index type in pgsql) as far as I can see. But it's a good idea / approach, so I'll look into it, thanks! On the whole though, as long as all the indexes occupy only a subset of pgsql's internal buffers, there shouldn't be a problem / that's not the problem, afaik. But, if we're making a well-researched ORM/database design, I should look into it.</div>


<div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">Do you have a list of searches you're planning to support?</blockquote><div><br></div><div>These are the ones that should *really* be supported: </div>


<div><ul><li>?search=nickname</li><li>?search=fingerprint</li><li>?lookup=fingerprint</li><li>?search=address [done some limited testing, currently not focusing on this]</li><li>?running=<boolean></li><li>?flag=flag [every kind of clause which further narrows down the query is not bad; the current db model supports all the flags that Stem does, and each flag has its own column]</li>


<li>?first_seen_days=range</li><li>?last_seen_days=range</li></ul><div>As per the plan, the db should be able to return a list of status entries / validafter ranges (which can be used in {first,last}_seen_days) given some fingerprint.</div>


</div><div><br></div><div>Thanks for your feedback and reply!</div><div><br></div><div>Kostas.</div><div><br></div><div><br></div><div>[1]: <a href="http://www.postgresql.org/docs/9.1/static/pgtrgm.html">http://www.postgresql.org/docs/9.1/static/pgtrgm.html</a></div>


</div></div></div>