[tor-dev] Searchable metrics archive - Onionoo-like API available online for probing

Mon Sep 2 16:39:56 UTC 2013

On Mon, Sep 2, 2013 at 2:20 PM, Karsten Loesing <karsten at torproject.org>wrote:

> On 8/23/13 3:12 PM, Kostas Jakeliunas wrote:
> > [snip]
>
> Hi Kostas,
>
> I finally managed to test your service and take a look at the
> specification document.

Hey Karsten!

Awesome, thanks a bunch!

The few tests I tried ran pretty fast!  I didn't hammer the service, so
> maybe there are still bottlenecks that I didn't find.  But AFAICS, you
> did a great job there!
>

Thanks for doing some poking! There is probably space for quite a bit more
of parallelized benchmarking (not sure of term) to be done, but at least in
principle (and from what I've observed / benchmarked so far), if a single
query runs in good time, it's rather safe to assume that scaling to
multiple queries at the same time will not be a big problem. There's always
a limit of course, which I haven't yet observed (and which I should be able
to / would do well to find, ideally.) This is, however, one of the
strengths of PostgreSQL in any case: very nice parallel-query-scaling. Of
course, since the queries are, more or less, always disk i/o-bound, there
still could be hidden sneaky bottlenecks, that is very true for sure.

> Thanks for writing down the specification.
>
> So, would it be accurate to say that you're mostly not touching summary,
> status, bandwidth, and weights resources, but that you're adding a new
> fifth resource statuses?
>
> In other words, does the attached diagram visualize what you're going to
> add to Onionoo?  Some explanations:
>
> - summary and details documents contain only the last known information
> about a relay or bridge, but those are on a pretty high detail level (at
> least for details documents).  In contrast to the current Onionoo, your
> service returns summary and details documents for relays that didn't run
> in the last week, so basically since 2007.  However, you're not going to
> provide summary or details for arbitrary points in time, right?  (Which
> is okay, I'm just asking if I understood this correctly.)
>

(Nice diagram, useful-) responding to particular points / nuances:

summary and details documents contain only the last known information
> about a relay or bridge, but those are on a pretty high detail level (at
> least for details documents)

This is true: the summary/details documents (just like in Onionoo proper)
deal with the *last* known info about relays. That is how it works now,
anyway.

As per our subsequent IRC chat, we will now assume this is how it is
intended to be. The way I see it from the perspective of my original
project goals etc., the summary and details (+ bandwidth and weights)
documents are meant for Onionoo {near-, full-}compatibility; they must stay
Onionoo-like. The new network status document is the "olden archive browse
and info extract" part: it is one of the ways of exposing an interface to
the whole database (after all, we do store all the flags and nicknames and
IP addresses for *all* the network statuses.)

However, you're not going to
> provide summary or details for arbitrary points in time, right?  (Which
> is okay, I'm just asking if I understood this correctly.)

There is no reason why this wouldn't be possible. (I experimented with new
search parameters, but haven't pushed them to master / changed the backend
instance that is currently running.)

A query involving date ranges could, for example, be something akin to,

"get a listing of details documents for relays which match this $nickname /
$address / $fingerprint, and which have run (been listed in consensuses
dated) from $startDate to $endDate." (would use new ?from=.., ?to=..
parameters, which you've mentioned / clarified earlier.)

As per our IRC chat, I will add these parameters / query options not only
to the network status document, but also to the summary and details
documents.

> - bandwidth and weights documents always contain information covering
> the whole lifetime of a relay or bridge, where recent events have higher
> detail level.  Again, you're not going to change anything here besides
> providing these documents for relays and bridges that are offline for
> more than a week.
>
> - statuses have the same level of detail for any time in the past.
> These documents are new.  They're designed for the relay search service
> and for a simplified version of ExoneraTor (which doesn't care about
> exit policies and doesn't provide original descriptor contents).  There
> are no statuses documents for bridges, right?
>

Yes & yes. No documents for bridges, for now. I'm not sure of the priority
of the task of including bridges - it would sure be awesome to have bridges
as well. For now, I assume that everything else should be finished (the
protocol, the final scalable database schema/setup, etc.) before embarking
on this point.

The status entry API point is indeed about getting info from the whole
archives, at the same detail level for any portion of the archives.

(I should have articulated this / put into a design doc before, but this
important nuance is still fresh in my mind. It seems that now it's all
finally coming into place (including my mind.))

[The new network status documents are] designed for the relay search service
> and for a simplified version of ExoneraTor (which doesn't care about
> exit policies and doesn't provide original descriptor contents).
>

By the way, just as a general note, it is always possible to reconstruct
any descriptor, and any network status entry, in principle. I point this
out because, for one, I recall Damian mentioning that it would be nice if
the torsearch system could be used as part of other apps - it would be able
to reconstruct original Stem instances/objects for any descriptor / network
status entry in question. (The focus for now, though, is Onionoo and
database, of course.)

> If this is correct (and please tell me if it's not), this seems like a
> plausible extension of Onionoo.
>

Thanks for taking a close look at the protocol description and thanks for
the feedback, everything is correct as far as I can see!

> A few ideas on statuses documents: how about you change the format of
> statuses, so that there's no more one document per relay and valid-after
> time, but exactly one document per relay?  That document could then
> contain an array of status objects saying when the relay was contained
> in the network status, together with information about its addresses.
>

This makes a lot of sense (I've been juggling these ideas as well, but at
the end of the day, I'm not sure. So I will do this instead.)

The nickname for a given relay (identified by a fingerprint) can change
through time as well. So the status object would ideally include the date
of containment in network status / consensus, addresses, and nickname.
(This is where a listing of flags would go in as well, I suppose.) I think
that would make sense?

Since we know that there will only be one relay document, its fields could
be made to be top-level (so not {relays: [ {"fingerprint" : "$fingerprint",
..., "entries": [ { ... }, { ... }, ... ]} ]} but, rather (hopefully not
garbled up identation),

{
  "fingerprint": "$fingerprint",
  ... # first_seen, last_seen, for example
  "entries": [
    { ... },
    { ... },
    ...
  ]
}

> It might be useful to group consecutive valid-after times when all
> addresses and other relevant information about a relay stayed the same.
>  So, rather than adding "valid_after", put in "valid_after_from" and
> "valid_after_to".

Yes, thought about this as well! This would be ideal. It would indeed I
think require that we

[...] could even generate these statuses documents in advance once

per hour and store them as JSON documents in the database, similar to
> what's the plan for the other document types?  That might reduce
> database load a lot, though you'll still need most of your database foo
> for the search part.
>

Some kind of caching at some level would be needed for sure, inevitably.
Preprocessing/preparing JSON documents (the way Onionoo does it, I suppose)
makes sense.

I'm not sure of scale, however. Ideally torsearch would be able to keep
track of outdated JSON documents / which ones need changing. Again, there
already are around 170K unique fingerprints in the current online database
as of now.

I'll think about this. Lots of things can be done at the postgres level
(you're probably thinking about this as well.)

Also:

If it was OK (it would be a bit queer maybe) to involve result pagination
at this level as well, the API could be told to, say,

"group the last `min(limit, UPPER_LIMIT)` [e.g. 500] status entries for
this fingerprint into a status object / valid-after range summary." =>
produce status entry objects, each featuring addresses, nickname,
valid_after_from, and valid_after_to.

As a rule of thumb, the count of status objects returned would be (much)
less than (say) 500, of course. A client would then append the parameters
?offset=500[&limit=500] (or whatnot) to get a status entry summary (summary
in the sense that does not reduce the amount of actual useful information
returned) for the next 500 network statuses of this relay.

It would be great if this kind of protocol querying approach made sense.
But if it's a bit strange / unoptimal (from the perspective of a client
querying the DB), let me know.

And maybe you can compress information even more by
> putting all relevant IP addresses in a list and refer to them by list
> index.  Compare this to bandwidth and weights documents which are
> optimized for size, too.

Yeah, this would be great, actually. I'll think about all these & practical
caching / JSON document generation options. I'm unsure of feasibility (it's
definitely doable in the end, but not sure of scope), but I hope to be able
to accomplish all this. Might follow up later on / tomorrow, etc.

> Happy to chat more about these ideas on IRC.
>
> > Please report any inconsistencies / errors / time-outs / anything that
> > takes a few seconds or more to execute. I'm logging the queries (together
> > with IP addresses for now - for shame!), so will be able to later
> correlate
> > activity with database load, which will hopefully provide some realistic
> > semi-benchmark-like data.
>
> I could imagine that you'll get more testers if you provide instructions
> for using your service as relay search or ExoneraTor replacement.  Maybe
> you could write down the five most common searches that people could
> perform to search for a relay or find out whether an IP address was a
> Tor relay at a given time?  If you want, I can link to such a page from
> the relay search and the ExoneraTor page.
>

Indeed, I was thinking lately that it should be made more explicit that,
for example, this present system already encompasses ExoneraTor use cases,
and so on. I was planning to eventually write up something of the kind
(with lots of examples and clearly articulated use cases, etc.) of course,
but maybe I should do this sooner. OK.

I also already have a way of constantly updating the database (using cron
-> rsync & torsearch import), but it's a bit of a hack, still. Hopefully
soon I will ramp up the DB to actually have the latest consensuses in
Reality(tm).

Once I have the latter running nicely,

> If you want, I can link to such a page from
> the relay search and the ExoneraTor page.

we can think of doing this!

> All in all, great work!  Nice!
>
> Thanks,
> Karsten
>

Thanks for your as always great feedback, Karsten :)

Kostas.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.torproject.org/pipermail/tor-dev/attachments/20130902/776d722e/attachment-0001.html>