<div dir="ltr"><div class="gmail_extra"><div><div dir="ltr"><div>On Mon, Sep 2, 2013 at 2:20 PM, Karsten Loesing <span dir="ltr"><<a href="mailto:karsten@torproject.org" target="_blank">karsten@torproject.org</a>></span> wrote:<br>


</div></div></div><div class="gmail_quote"><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex"><div class="im">


On 8/23/13 3:12 PM, Kostas Jakeliunas wrote:<br>

> [snip]</div><div class="im">

<br>

</div>Hi Kostas,<br>

<br>

I finally managed to test your service and take a look at the<br>

specification document.</blockquote><div><div><br></div><div>Hey Karsten!</div><div><br></div><div>Awesome, thanks a bunch!</div><div class="gmail_extra"><div dir="ltr"><div><br></div></div></div></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">


The few tests I tried ran pretty fast! Â I didn't hammer the service, so<br>

maybe there are still bottlenecks that I didn't find. Â But AFAICS, you<br>

did a great job there!<br></blockquote><div><br></div><div><div>Thanks for doing some poking! There is probably space for quite a bit more of parallelized benchmarking (not sure of term) to be done, but at least in principle (and from what I've observed / benchmarked so far), if a single query runs in good time, it's rather safe to assume that scaling to multiple queries at the same time will not be a big problem. There's always a limit of course, which I haven't yet observed (and which I should be able to / would do well to find, ideally.) This is, however, one of the strengths of PostgreSQL in any case: very nice parallel-query-scaling. Of course, since the queries are, more or less, always disk i/o-bound, there still could be hidden sneaky bottlenecks, that is very true for sure.</div>


<div class="gmail_extra"><div dir="ltr"></div></div></div><div>Â </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">


Thanks for writing down the specification.<br>

<br>

So, would it be accurate to say that you're mostly not touching summary,<br>

status, bandwidth, and weights resources, but that you're adding a new<br>

fifth resource statuses?<br>

<br>

In other words, does the attached diagram visualize what you're going to<br>

add to Onionoo? Â Some explanations:<br>

<br>

- summary and details documents contain only the last known information<br>

about a relay or bridge, but those are on a pretty high detail level (at<br>

least for details documents). Â In contrast to the current Onionoo, your<br>

service returns summary and details documents for relays that didn't run<br>

in the last week, so basically since 2007. Â However, you're not going to<br>

provide summary or details for arbitrary points in time, right? Â (Which<br>

is okay, I'm just asking if I understood this correctly.)<br></blockquote><div><br></div><div>(Nice diagram, useful-) responding to particular points / nuances:</div><div><br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">


summary and details documents contain only the last known information<br>about a relay or bridge, but those are on a pretty high detail level (at<br>least for details documents)</blockquote><div><br></div><div>This is true: the summary/details documents (just like in Onionoo proper) deal with the *last* known info about relays. That is how it works now, anyway.</div>


<div><br></div><div>As per our subsequent IRC chat, we will now assume this is how it is intended to be. The way I see it from the perspective of my original project goals etc., the summary and details (+ bandwidth and weights) documents are meant for Onionoo {near-, full-}compatibility; they must stay Onionoo-like. The new network status document is the "olden archive browse and info extract" part: it is one of the ways of exposing an interface to the whole database (after all, we do store all the flags and nicknames and IP addresses for *all* the network statuses.)</div>


<div><br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">However, you're not going to<br>provide summary or details for arbitrary points in time, right? Â (Which<br>


is okay, I'm just asking if I understood this correctly.)</blockquote><div><br></div><div>There is no reason why this wouldn't be possible. (I experimented with new search parameters, but haven't pushed them to master / changed the backend instance that is currently running.)</div>


<div>Â </div><div>A query involving date ranges could, for example, be something akin to,</div><div><br></div><div>"get a listing of details documents for relays which match this $nickname / $address / $fingerprint, and which have run (been listed in consensuses dated) from $startDate to $endDate." (would use new ?from=.., ?to=.. parameters, which you've mentioned / clarified earlier.)</div>


<div><br></div><div>As per our IRC chat, I will add these parameters / query options not only to the network status document, but also to the summary and details documents.</div><div>Â </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">


- bandwidth and weights documents always contain information covering<br>

the whole lifetime of a relay or bridge, where recent events have higher<br>

detail level. Â Again, you're not going to change anything here besides<br>

providing these documents for relays and bridges that are offline for<br>

more than a week.<br>

<br>

- statuses have the same level of detail for any time in the past.<br>

These documents are new. Â They're designed for the relay search service<br>

and for a simplified version of ExoneraTor (which doesn't care about<br>

exit policies and doesn't provide original descriptor contents). Â There<br>

are no statuses documents for bridges, right?<br></blockquote><div><br></div><div>Yes & yes. No documents for bridges, for now. I'm not sure of the priority of the task of including bridges - it would sure be awesome to have bridges as well. For now, I assume that everything else should be finished (the protocol, the final scalable database schema/setup, etc.) before embarking on this point.</div>


<div><br></div><div>The status entry API point is indeed about getting info from the whole archives, at the same detail level for any portion of the archives.</div><div><br></div><div>(I should have articulated this / put into a design doc before, but this important nuance is still fresh in my mind. It seems that now it's all finally coming into place (including my mind.))</div>


<div><br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">[The new network status documents are] designed for the relay search service<br>


and for a simplified version of ExoneraTor (which doesn't care about<br>exit policies and doesn't provide original descriptor contents).<br></blockquote><div><br></div><div>By the way, just as a general note, it is always possible to reconstruct any descriptor, and any network status entry, in principle. I point this out because, for one, I recall Damian mentioning that it would be nice if the torsearch system could be used as part of other apps - it would be able to reconstruct original Stem instances/objects for any descriptor / network status entry in question. (The focus for now, though, is Onionoo and database, of course.)</div>


<div>Â </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">

If this is correct (and please tell me if it's not), this seems like a<br>

plausible extension of Onionoo.<br></blockquote><div><br></div><div>Thanks for taking a close look at the protocol description and thanks for the feedback, everything is correct as far as I can see!</div><div>Â </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">


A few ideas on statuses documents: how about you change the format of<br>

statuses, so that there's no more one document per relay and valid-after<br>

time, but exactly one document per relay? Â That document could then<br>

contain an array of status objects saying when the relay was contained<br>

in the network status, together with information about its addresses.<br></blockquote><div><br></div><div>This makes a lot of sense (I've been juggling these ideas as well, but at the end of the day, I'm not sure. So I will do this instead.)</div>


<div><br></div><div>The nickname for a given relay (identified by a fingerprint) can change through time as well. So the status object would ideally include the date of containment in network status / consensus, addresses, and nickname. (This is where a listing of flags would go in as well, I suppose.) I think that would make sense?</div>


<div><br></div><div>Since we know that there will only be one relay document, its fields could be made to be top-level (so not {relays: [ {"fingerprint" : "$fingerprint", ..., "entries": [ { ... }, { ... }, ... ]} ]} but, rather (hopefully not garbled up identation),</div>


<div><br></div><div>{</div><div>Â  "fingerprint": "$fingerprint",</div><div>Â  ... # first_seen, last_seen, for example</div><div>Â  "entries": [</div><div>Â  Â  { ... },</div><div>Â  Â  { ... },</div>


<div>Â  Â  ...</div><div>Â  ]</div><div>}</div><div>Â </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">

It might be useful to group consecutive valid-after times when all<br>

addresses and other relevant information about a relay stayed the same.<br>

Â So, rather than adding "valid_after", put in "valid_after_from" and<br>

"valid_after_to".</blockquote><div>Â </div><div>Yes, thought about this as well! This would be ideal. It would indeed I think require that we</div><div><br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">


[...] could even generate these statuses documents in advance once</blockquote><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">


per hour and store them as JSON documents in the database, similar to<br>

what's the plan for the other document types? Â That might reduce<br>

database load a lot, though you'll still need most of your database foo<br>

for the search part.<br></blockquote><div><br></div><div>Some kind of caching at some level would be needed for sure, inevitably. Preprocessing/preparing JSON documents (the way Onionoo does it, I suppose) makes sense.</div>


<div><br></div><div>I'm not sure of scale, however. Ideally torsearch would be able to keep track of outdated JSON documents / which ones need changing. Again, there already are around 170K unique fingerprints in the current online database as of now.</div>


<div><br></div><div>I'll think about this. Lots of things can be done at the postgres level (you're probably thinking about this as well.)</div><div><br></div><div>Also:</div><div><br></div><div>If it was OK (it would be a bit queer maybe) to involve result pagination at this level as well, the API could be told to, say,</div>


<div><br></div><div>"group the last `min(limit, UPPER_LIMIT)` [e.g. 500] status entries for this fingerprint into a status object / valid-after range summary." => produce status entry objects, each featuring addresses, nickname, valid_after_from, and valid_after_to.</div>


<div><br></div><div>As a rule of thumb, the count of status objects returned would be (much) less than (say) 500, of course. A client would then append the parameters ?offset=500[&limit=500] (or whatnot) to get a status entry summary (summary in the sense that does not reduce the amount of actual useful information returned) for the next 500 network statuses of this relay.</div>


<div><br></div><div>It would be great if this kind of protocol querying approach made sense. But if it's a bit strange / unoptimal (from the perspective of a client querying the DB), let me know.</div><div><br></div>

<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">

And maybe you can compress information even more by<br>putting all relevant IP addresses in a list and refer to them by list<br>index. Â Compare this to bandwidth and weights documents which are<br>optimized for size, too.</blockquote>


<div><br></div><div>Yeah, this would be great, actually. I'll think about all these & practical caching / JSON document generation options. I'm unsure of feasibility (it's definitely doable in the end, but not sure of scope), but I hope to be able to accomplish all this. Might follow up later on / tomorrow, etc.</div>


<div>Â </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">

Happy to chat more about these ideas on IRC.<br>

<div class="im"><br>

> Please report any inconsistencies / errors / time-outs / anything that<br>

> takes a few seconds or more to execute. I'm logging the queries (together<br>

> with IP addresses for now - for shame!), so will be able to later correlate<br>

> activity with database load, which will hopefully provide some realistic<br>

> semi-benchmark-like data.<br>

<br>

</div>I could imagine that you'll get more testers if you provide instructions<br>

for using your service as relay search or ExoneraTor replacement. Â Maybe<br>

you could write down the five most common searches that people could<br>

perform to search for a relay or find out whether an IP address was a<br>

Tor relay at a given time? Â If you want, I can link to such a page from<br>

the relay search and the ExoneraTor page.<br></blockquote><div><br></div><div>Indeed, I was thinking lately that it should be made more explicit that, for example, this present system already encompasses ExoneraTor use cases, and so on. I was planning to eventually write up something of the kind (with lots of examples and clearly articulated use cases, etc.) of course, but maybe I should do this sooner. OK.</div>


<div><br></div><div>I also already have a way of constantly updating the database (using cron -> rsync & torsearch import), but it's a bit of a hack, still. Hopefully soon I will ramp up the DB to actually have the latest consensuses in Reality(tm).</div>


<div><br></div><div>Once I have the latter running nicely,</div><div><br></div><div>> If you want, I can link to such a page from</div>> the relay search and the ExoneraTor page.</div><div class="gmail_quote"><br></div>


<div class="gmail_quote">we can think of doing this!<br><div>Â </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-color:rgb(204,204,204);border-left-style:solid;padding-left:1ex">


All in all, great work! Â Nice!<br>

<br>

Thanks,<br>

Karsten<br></blockquote><div><br></div><div>Thanks for your as always great feedback, Karsten :)</div><div><br></div><div>Kostas.Â </div></div></div></div>