[metrics-team] OnionStats - roadmap?

Mon Aug 8 15:50:14 UTC 2016

On 08/08/2016 15:13, Karsten Loesing wrote:
> 
> Maybe an example helps here: assume you have 55 relays and 45 bridges
> and ask for offset=50 and limit=10.  Your implementation will return
> relays 51 to 55, the current implementation will also return bridges 1
> to 5.

Oh I see it. Now a question: how is it possible that by specifying
offset=50 and limit=10, you have bridges from 1 to 5? Since you're
skipping 50 results and you have only 45 bridges, makes sense to me to
return nothing.

The logic then is to treat nodes' data as a tape? When I hit the border
(the limit) the system should start from the beginning?

After clarified that I'll change it to make it backward-compatible.

> 
> We'll have to specify how things are sorted anyway, because we can't
> just say "however ElasticSearch sorts it".  Rather than adding all
> fields at once we could start with the ones that need little to no
> discussion and then move on to the more complex ones.  I mean, if you
> want to go through all of them and specify sorting orders, okay.
> 

here are some information on how ElasticSearch sorts:
https://www.elastic.co/guide/en/elasticsearch/reference/current/search-request-sort.html

It covers all the scenarios outlined in the previous email.

However, I think I'm missing the point here: when you say "sorting
orders" you mean that if you specify multiple sorting fields, who goes
first?

> I guess the real issue is that we're looking at this from two
> different angles.  I'm looking at the existing protocol whether it can
> be implemented using ElasticSearch, and you're looking at your
> existing ElasticSearch implementation and thinking how to make it
> implement the current protocol.  However, the user won't care how the
> protocol is implemented, they'll just happily notice that there are
> extensions that haven't been there before.  And on the other hand, if
> we change the protocol without good reason, current Onionoo client
> developers will ask WTF we're thinking.  Hope that makes sense.
> 
Yeah totally agree. Compatibility is the main goal here. I'm just trying
to understand the current protocol better so the new protocol can be 99%
(even 100% :) backward-compatible.

Also, this discussion is helping (I hope it will) in drafting new
features/improvements we want to see in the next version of the Onionoo
protocol.

> 
> Did you read the part in the specification describing the differences?
>
Yes sir, I've kept reading that page almost 5 times a day, and I asked
because I didn't get it.

So maybe with an example would be better. Let's assume the fingerprint
ABBFB8C728A482536A8B599B51EDE48A4621D0A2

It's a relay, so it's not hashed. Using the 'fingerprint' parameter
works as expected. About 'lookup', as the protocol say, "Fingerprints
should always be hashed using SHA-1". However, using the 'lookup'
parameter with that same string, it returns the same data as with the
'fingerprint' parameter.

Same for bridges.

So I'm assuming that the fingerprints are already hashed in the dataset.
In this case, lookup and fingerprint work the same way. Is that correct?

The only difference is in "(2) the response will contain any matching
relay or bridge regardless of whether they have been running in the past
week" (in 'fingerprint') which means that I've to select, for the
'lookup' parameter, only the bridges/relays that have been running in
the past 7 days. Is it correct?

> What we might do, though that would be a backward-incompatible change,
> is to add the space-separated fingerprint to the search index as a
> single string rather than ten strings.  Example: Tonga has fingerprint
> 4A0CCD2DDC7995083D73F5D667100C8A5831F16D, which we add to the search
> index as "4A0C CD2D DC79 9508 3D73 F5D6 6710 0C8A 5831 F16D", and when
> we receive a search for "Tonga 4A0C CD2D DC79", we find that both
> nickname and beginning of the space-separated fingerprint match that
> query.  But I'm not sure how much that improves.
> 

mmm. currently, 'fingerprint' is not compatible with 'search'. If we
enable it, we can do something like search=Tonga&fingerprint="4A0C CD2D
DC79". But not sure if it's worth the change. What do you think?

> 
> Can you sketch out your suggestion as far as it concerns the protocol
> level?  (I can't look at code, or I'll spend half an hour there and
> even more people will wonder why their emails are left unanswered.)
> 

(It was not exactly code but a description of the bug and how I solved it).

To report here, basically I'm assuming that the maximum number of
results is 10000 (10k)

So if you specify offset=10, the maximum data returned is 9990 (10000 -
10). If you specify offset=9000, it will return 1000 results, etc etc.

I've made a stub to overcome the 10k limit (it leverages ES' scroll
keyword) and benchmarked a little bit. While we don't have any limit on
data, the queries are slower (by around 2s). Also, as you may imagine,
the more results we get, more RAM will be used (in this case, we can try
to fix the issue by using varnish, but I didn't test it so I don't have
numbers).

-- 
Anathema

+------------------------------------------------------------------+
   GPG/PGP KeyID: CFF94F0A available on http://pgpkeys.mit.edu:11371/
   Fingerprint: 80CE EC23 2D16 143F 6B25  6776 1960 F6B4 CFF9 4F0A

   https://keybase.io/davbarbato
+------------------------------------------------------------------+