[tor-dev] [GSoC 2013] Status report - Searchable metrics archive

Karsten Loesing karsten at torproject.org
Mon Aug 19 12:49:50 UTC 2013


Hi Kostas,

On 8/15/13 9:50 PM, Kostas Jakeliunas wrote:
> On Wed, Aug 14, 2013 at 1:33 PM, Karsten Loesing <karsten at torproject.org>wrote:
> 
>>
>> Looks like pg_trgm is contained in postgresql-contrib-9.1, so it's more
>> likely that we can run something requiring this extension on a
>> torproject.org machine.  Still, requiring extensions should be the last
>> resort if no other solution can be found.  Leaving out searches for
>> nickname substrings is a valid solution for now.
> 
> 
> Got it.
> 
>  >> Do you have a list of searches you're planning to support?
>>>
>>>
>>> These are the ones that should *really* be supported:
>>>
>>>    - ?search=nickname
>>>    - ?search=fingerprint
>>>    - ?lookup=fingerprint
>>>    - ?search=address [done some limited testing, currently not focusing
>> on
>>>    this]
>>
>> The lookup parameter is basically the same as search=fingerprint with
>> the additional requirement that fingerprint must be 40 characters long.
>>  So, this is the current search parameter.
>>
>> I agree, these would be good to support.
>>
>> You might also add another parameter ?address=address for ExoneraTor.
>> That should, in theory, be just a subset of the search parameter.
>>
> 
> Oh yes, makes a lot of sense, OK.
> 
> By the way: I considered having the last consensus (all the data for at
> least the /summary document, or /details as well) be stored in memory (this
> is possible) (probably as a hashtable where key = fingerprint, value = all
> the fields we'd need to return) so that when the backend is queried without
> any search criteria, it would be possible to avoid hitting the database
> (which is always nice), and just dump the last consensus. (There's also
> caching of course, which we could discuss at a (probably quite a bit) later
> point.)

Okay.

>>>    - ?running=<boolean>
>>
>> This one is tricky.  So far, Onionoo looks only at the very latest
>> consensus or bridge status to decide if a relay or bridge is running or
>> not.
>>
>> But now you're adding archives to Onionoo, so that people can search for
>> a certain consensus or certain bridge status in the past, or they can
>> search for a time interval of consensuses or bridge statuses.  How do
>> you define that a relay or bridge is running, or more importantly
>> included as not running?
>>
> 
> Agree, this is not clear. (And whatever ends up being done, this should be
> well documented and clearly articulated (of course.))
> 
> For me at least, 'running' implies the clause whether a given relay/bridge
> is running *right now*, i.e. whether it is present in the very last
> consensus. (Here's where that hashtable (with fingerprints as keys) in
> memory might be able to help: no need to run a separate query / do an inner
> join / whatnot; it would depend on whether there's a LIMIT involved though,
> etc.)
> 
> I'm not sure which one is more useful (intuitively for me, the "whether it
> is running *right now*" is more useful.) Do you mean that it might make
> sense to have a field (or have "running" be it) indicating whether a given
> relay/bridge was present in the last consensus in the specified date range?
> If this is what you meant, then the "return all that are/were not running"
> clause would indeed be kind of..peculiar (semantically - it wouldn't be
> very obvious what's it doing.)
> 
> Maybe it'd be simpler to first answer, what would be the most useful case?
> 
>> How do you define that a relay or bridge [should be] included as not
> running?
> 
> Could you rephrase maybe? Do you mean that it might be difficult to
> construct sane queries to check for this condition? Or that the situation
> where
> 
>    - a "from..to" date range is specified
>    - ?running=false is specified
> 
> would be rather confusing ('exclude those nodes which are running *right
> now* ('now' possibly having nothing to do with the date range)?

I was referring to the situation you describe.  But yes, I agree that
your definition of whether a relay or bridge is running *right now* can
work here.  So, never mind my question/concern, this looks fine!

>  >    - ?flag=flag [every kind of clause which further narrows down the
>> query
>>>    is not bad; the current db model supports all the flags that Stem
>> does, and
>>>    each flag has its own column]
>>
>> I'd say leave this one out until there's an actual use case.
>>
> 
> Ok, I won't focus on these now; just wanted to say that these should be
> possible without much ado/problems.

Okay.

>>>    - ?first_seen_days=range
>>>    - ?last_seen_days=range
>>>
>>> As per the plan, the db should be able to return a list of status
>> entries /
>>> validafter ranges (which can be used in {first,last}_seen_days) given
>> some
>>> fingerprint.
>>
>> Oh, I think there's a misunderstanding of these two fields.  These
>> fields are only there to search for relays or bridges that have first
>> appeared or were last seen on a given day.
>>
>> You'll need two new parameters, say, from=datetime and to=datetime (or
>> start=datetime and end=datetime) to define a valid-after range for your
>> search.
>>
> 
> Ah! I wasn't paying attention here. :) Ok, all good.

Okay.

I wonder, is there a document describing the new API somewhere?  If not,
do you mind creating one?

All the best,
Karsten



More information about the tor-dev mailing list