[tor-dev] [GSoC 2013] Status report - Searchable metrics archive

Kostas Jakeliunas kostas at jakeliunas.com
Thu Aug 15 19:50:59 UTC 2013

On Wed, Aug 14, 2013 at 1:33 PM, Karsten Loesing <karsten at torproject.org>wrote:

> Looks like pg_trgm is contained in postgresql-contrib-9.1, so it's more
> likely that we can run something requiring this extension on a
> torproject.org machine.  Still, requiring extensions should be the last
> resort if no other solution can be found.  Leaving out searches for
> nickname substrings is a valid solution for now.

Got it.

 >> Do you have a list of searches you're planning to support?
> >
> >
> > These are the ones that should *really* be supported:
> >
> >    - ?search=nickname
> >    - ?search=fingerprint
> >    - ?lookup=fingerprint
> >    - ?search=address [done some limited testing, currently not focusing
> on
> >    this]
> The lookup parameter is basically the same as search=fingerprint with
> the additional requirement that fingerprint must be 40 characters long.
>  So, this is the current search parameter.
> I agree, these would be good to support.
> You might also add another parameter ?address=address for ExoneraTor.
> That should, in theory, be just a subset of the search parameter.

Oh yes, makes a lot of sense, OK.

By the way: I considered having the last consensus (all the data for at
least the /summary document, or /details as well) be stored in memory (this
is possible) (probably as a hashtable where key = fingerprint, value = all
the fields we'd need to return) so that when the backend is queried without
any search criteria, it would be possible to avoid hitting the database
(which is always nice), and just dump the last consensus. (There's also
caching of course, which we could discuss at a (probably quite a bit) later

> >    - ?running=<boolean>
> This one is tricky.  So far, Onionoo looks only at the very latest
> consensus or bridge status to decide if a relay or bridge is running or
> not.
> But now you're adding archives to Onionoo, so that people can search for
> a certain consensus or certain bridge status in the past, or they can
> search for a time interval of consensuses or bridge statuses.  How do
> you define that a relay or bridge is running, or more importantly
> included as not running?

Agree, this is not clear. (And whatever ends up being done, this should be
well documented and clearly articulated (of course.))

For me at least, 'running' implies the clause whether a given relay/bridge
is running *right now*, i.e. whether it is present in the very last
consensus. (Here's where that hashtable (with fingerprints as keys) in
memory might be able to help: no need to run a separate query / do an inner
join / whatnot; it would depend on whether there's a LIMIT involved though,

I'm not sure which one is more useful (intuitively for me, the "whether it
is running *right now*" is more useful.) Do you mean that it might make
sense to have a field (or have "running" be it) indicating whether a given
relay/bridge was present in the last consensus in the specified date range?
If this is what you meant, then the "return all that are/were not running"
clause would indeed be kind of..peculiar (semantically - it wouldn't be
very obvious what's it doing.)

Maybe it'd be simpler to first answer, what would be the most useful case?

> How do you define that a relay or bridge [should be] included as not

Could you rephrase maybe? Do you mean that it might be difficult to
construct sane queries to check for this condition? Or that the situation

   - a "from..to" date range is specified
   - ?running=false is specified

would be rather confusing ('exclude those nodes which are running *right
now* ('now' possibly having nothing to do with the date range)?

 >    - ?flag=flag [every kind of clause which further narrows down the
> query
> >    is not bad; the current db model supports all the flags that Stem
> does, and
> >    each flag has its own column]
> I'd say leave this one out until there's an actual use case.

Ok, I won't focus on these now; just wanted to say that these should be
possible without much ado/problems.

> >    - ?first_seen_days=range
> >    - ?last_seen_days=range
> >
> > As per the plan, the db should be able to return a list of status
> entries /
> > validafter ranges (which can be used in {first,last}_seen_days) given
> some
> > fingerprint.
> Oh, I think there's a misunderstanding of these two fields.  These
> fields are only there to search for relays or bridges that have first
> appeared or were last seen on a given day.
> You'll need two new parameters, say, from=datetime and to=datetime (or
> start=datetime and end=datetime) to define a valid-after range for your
> search.

Ah! I wasn't paying attention here. :) Ok, all good.

Thanks as always!
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.torproject.org/pipermail/tor-dev/attachments/20130815/7478623d/attachment.html>

More information about the tor-dev mailing list