[metrics-team] OnionStats - roadmap?

Karsten Loesing karsten at torproject.org
Fri Aug 12 15:21:03 UTC 2016


-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Hi Anathema,

On 08/08/16 17:50, Anathema wrote:
> On 08/08/2016 15:13, Karsten Loesing wrote:
>> 
>> Maybe an example helps here: assume you have 55 relays and 45
>> bridges and ask for offset=50 and limit=10.  Your implementation
>> will return relays 51 to 55, the current implementation will also
>> return bridges 1 to 5.
> 
> Oh I see it. Now a question: how is it possible that by specifying 
> offset=50 and limit=10, you have bridges from 1 to 5? Since you're 
> skipping 50 results and you have only 45 bridges, makes sense to me
> to return nothing.
> 
> The logic then is to treat nodes' data as a tape? When I hit the
> border (the limit) the system should start from the beginning?
> 
> After clarified that I'll change it to make it
> backward-compatible.

Here's an actual example:

The current network contains around 8.4k relays and 5.1k bridges:

$ wget -q -O - https://onionoo.torproject.org/summary | grep -c "\"f\""
8398
$ wget -q -O - https://onionoo.torproject.org/summary | grep -c "\"h\""
5095

When you skip the first 8k results and ask for 2k results, you'll get
the remaining 0.4k relays and the first 1.6k bridges:

$ wget -q -O -
"https://onionoo.torproject.org/summary?offset=8000&limit=2000" | grep
- -c "\"f\""
398
$ wget -q -O -
"https://onionoo.torproject.org/summary?offset=8000&limit=2000" | grep
- -c "\"h\""
1602

But it's not a tape.  If you skip the first 13k results, you'll only
get the remaining 0.5k results, which are all bridges:

$ wget -q -O - "https://onionoo.torproject.org/summary?offset=13000" |
grep -c "\"f\""
0
$ wget -q -O - "https://onionoo.torproject.org/summary?offset=13000" |
grep -c "\"h\""
493

>> We'll have to specify how things are sorted anyway, because we
>> can't just say "however ElasticSearch sorts it".  Rather than
>> adding all fields at once we could start with the ones that need
>> little to no discussion and then move on to the more complex
>> ones.  I mean, if you want to go through all of them and specify
>> sorting orders, okay.
>> 
> 
> here are some information on how ElasticSearch sorts: 
> https://www.elastic.co/guide/en/elasticsearch/reference/current/search-request-sort.html
>
>  It covers all the scenarios outlined in the previous email.
> 
> However, I think I'm missing the point here: when you say "sorting 
> orders" you mean that if you specify multiple sorting fields, who
> goes first?

I'm not talking about multiple fields here, but about different values
of a single field.  And this doesn't have to be complicated, we'll
just need to specify how values are sorted.  In other words, you'd
have to write this down for the protocol specification, and without
linking to how ElasticSearch sorts, because Onionoo users shouldn't
have to even notice that there's ElasticSearch behind Onionoo.

>> I guess the real issue is that we're looking at this from two 
>> different angles.  I'm looking at the existing protocol whether
>> it can be implemented using ElasticSearch, and you're looking at
>> your existing ElasticSearch implementation and thinking how to
>> make it implement the current protocol.  However, the user won't
>> care how the protocol is implemented, they'll just happily notice
>> that there are extensions that haven't been there before.  And on
>> the other hand, if we change the protocol without good reason,
>> current Onionoo client developers will ask WTF we're thinking.
>> Hope that makes sense.
>> 
> Yeah totally agree. Compatibility is the main goal here. I'm just
> trying to understand the current protocol better so the new
> protocol can be 99% (even 100% :) backward-compatible.
> 
> Also, this discussion is helping (I hope it will) in drafting new 
> features/improvements we want to see in the next version of the
> Onionoo protocol.

Sure!

>> Did you read the part in the specification describing the
>> differences?
>> 
> Yes sir, I've kept reading that page almost 5 times a day, and I
> asked because I didn't get it.
> 
> So maybe with an example would be better. Let's assume the
> fingerprint ABBFB8C728A482536A8B599B51EDE48A4621D0A2
> 
> It's a relay, so it's not hashed. Using the 'fingerprint'
> parameter works as expected. About 'lookup', as the protocol say,
> "Fingerprints should always be hashed using SHA-1". However, using
> the 'lookup' parameter with that same string, it returns the same
> data as with the 'fingerprint' parameter.

Yes, it says "should", not "must".  The following two queries return
the same relay, where the second fingerprint is the SHA-1 value of the
first fingerprint:

lookup=ABBFB8C728A482536A8B599B51EDE48A4621D0A2  <- OK

lookup=7B7CD45661EE5074B84B2953BDA6D7109764CA4D  <- OK

> Same for bridges.

lookup=15C75BACC5DA4FA59FDE086E0EEA40428B788186  <- OK

lookup=0FD37C58780020BDF934CBB0A5922475BF0E942D  <- OK

> So I'm assuming that the fingerprints are already hashed in the
> dataset. In this case, lookup and fingerprint work the same way. Is
> that correct?

No.  Only the first of the following two queries works:

fingerprint=ABBFB8C728A482536A8B599B51EDE48A4621D0A2  <- OK

fingerprint=7B7CD45661EE5074B84B2953BDA6D7109764CA4D  <- NOTHING!

> The only difference is in "(2) the response will contain any
> matching relay or bridge regardless of whether they have been
> running in the past week" (in 'fingerprint') which means that I've
> to select, for the 'lookup' parameter, only the bridges/relays that
> have been running in the past 7 days. Is it correct?

Yes, but you'll have to do that for _all_ parameters except for
"fingerprint".

>> What we might do, though that would be a backward-incompatible
>> change, is to add the space-separated fingerprint to the search
>> index as a single string rather than ten strings.  Example: Tonga
>> has fingerprint 4A0CCD2DDC7995083D73F5D667100C8A5831F16D, which
>> we add to the search index as "4A0C CD2D DC79 9508 3D73 F5D6 6710
>> 0C8A 5831 F16D", and when we receive a search for "Tonga 4A0C
>> CD2D DC79", we find that both nickname and beginning of the
>> space-separated fingerprint match that query.  But I'm not sure
>> how much that improves.
>> 
> 
> mmm. currently, 'fingerprint' is not compatible with 'search'. If
> we enable it, we can do something like
> search=Tonga&fingerprint="4A0C CD2D DC79". But not sure if it's
> worth the change. What do you think?

No, that's not what I had in mind here.  Most clients will just use
the "search" parameter.  My idea was rather to implement the "search"
parameter differently.  Right now, we're splitting the parameter value
at the spaces and looking whether we can find all parts in the search
index.  And obviously, none of those values in the search index
contain spaces.  But we could do something more complex to require the
"4A0C CD2D DC79" part of the search to belong together and not return
a relay with a fingerprint that starts with, say, DC79 CD2D 4A0C...
But!  Let's not go into the details here, this was just a quick idea,
and it's not one of the most pressing problems we need to solve.

>> Can you sketch out your suggestion as far as it concerns the
>> protocol level?  (I can't look at code, or I'll spend half an
>> hour there and even more people will wonder why their emails are
>> left unanswered.)
>> 
> 
> (It was not exactly code but a description of the bug and how I
> solved it).
> 
> To report here, basically I'm assuming that the maximum number of 
> results is 10000 (10k)
> 
> So if you specify offset=10, the maximum data returned is 9990
> (10000 - 10). If you specify offset=9000, it will return 1000
> results, etc etc.
> 
> I've made a stub to overcome the 10k limit (it leverages ES'
> scroll keyword) and benchmarked a little bit. While we don't have
> any limit on data, the queries are slower (by around 2s). Also, as
> you may imagine, the more results we get, more RAM will be used (in
> this case, we can try to fix the issue by using varnish, but I
> didn't test it so I don't have numbers).

Using the numbers above, there are currently 13.5k relays and bridges
that have been running in the past week.  We should expect to return
them all, though we'll likely return fewer than that.

All the best,
Karsten

-----BEGIN PGP SIGNATURE-----
Comment: GPGTools - http://gpgtools.org

iQEcBAEBAgAGBQJXrelfAAoJEC3ESO/4X7XBZD4H/A+CQquQqnY10zFUUwB13lZ1
kqiO7ttSxjeBuksAUlgpUjHrCu7jxnOG44JYm3BnHugfZxi5rGSuvTvh+VvkNd0E
McXoNCWbpcGZJ+FRjqGdceA+ctoiy9mMLJxMgwBP4MYDDvgRhqF4cVk8hg2Ml/R2
h8bQZ31NCnrG+gZjYGUDERQDzli3si1smkcb40M5VMJjtEF8+KCMGno/lTehOBX6
Dn8vk/ODOl4QWk7mufpukVUETE5qJgeO64J1zAwNm6a2uW6HXzDhsgg2C1tvXuRs
8WCULJfgv00wqFY2h6oMvpHYy0JfLpwWOeAHTACqx0OFif1mZ06P/Joeqm+eBno=
=svgZ
-----END PGP SIGNATURE-----


More information about the metrics-team mailing list