[metrics-team] Fwd: Re: OnionStats

Thu Jun 2 22:13:51 UTC 2016

Sorry, I just realize I sent this email to Karsten and not to the list.

I apologize

-------- Forwarded Message --------
Subject: Re: [metrics-team] OnionStats
Date: Tue, 31 May 2016 21:15:52 +0100
From: Anathema <anathema at anche.no>
To: Karsten Loesing <karsten at torproject.org>

On 31/05/2016 16:30, Karsten Loesing wrote:

> As you may have seen there won't be a meeting this week.  So maybe you
> can make it to the meeting next week?  But in any case, presenting
> your project in this list is a good idea, because there's only so much
> time at the IRC meeting for each topic, so that is best spent if
> people are prepared.  Thanks for sharing your project here.
> 

Unfortunately I'll not be able the next week neither.

> 
> Do you mean the limit of returning no more than 40 results?  If so,
> yes, that's unfortunate.
> 

Yes

>> - its query language doesn't allow complex and combined queries
> 
> Agreed.  Again, Onionoo is to blame here.

Yes if you send a query search directly to Onionoo, which based on his
slowness, I think it's not a good idea (that's why I implemented mongodb
+ elasticsearch)

> 
> Uhm, I may be overlooking something, but it always tells me 0 results.
>  Is that because of recent Onionoo troubles, or am I doing something
> wrong?  Can you give a sample query that should return results?
>

Did you checked the datepicker gadget? By default it looks at data
first_seen today(). if you go back in time of, let's say, 6 months, you
should have more data.

For example, using OnionStats I found out that the default nickname for
Tor relays in Windows is 'default' :) (except for a box in Russia that
runs Linux, but it just showed up yesterday so I might think it's
manually entered)

You can verify by selecting the datepicker "Last 6 months", and

nickname:default

then you can use datatable to filter for "linux". Or you can run the query:

nickname:default platform:linux

> 
> Yep, feel free to decrease that to once per hour or even more often,
> as long as you set the `If-Modified-Since` header in your request.  In
> that case you're not wasting any bandwidth or cycles if the data has
> not changed.  You could even set that to every 5 minutes, but please
> please make sure you have set that header.
> 

Wow, I was not aware of this little trick: I'll implement it soon, thanks!

> And just to be clear, whenever you fetch new data you're throwing out
> the old data, right?  That is, you're not creating an archive of
> Onionoo data?  I think I mentioned this concern before: there exist no
> archives of Onionoo data, just archives of the underlying Tor
> descriptors, so whenever you'd set up your service would be the
> earliest date in your own archive.  And whenever your service
> temporarily fails is when your archive has a gap.  So don't do that.
> Just kill the old data when you're getting new data and rely on
> Onionoo to be your archive.  Alternatively, go for original
> descriptors provided by DescripTor, but that's a very different
> project then.
> 

Yes I'm deleting the old data. When we spoke about that, we realize that
maybe it would be worth trying to expose the archived data presented in
CollecTor as a zip file.

So I was also thinking if it's worth kick off another project (or
integrate it into an already existing one) that basically does the same
things that Onionoo does but including the old data also. Because I
think we should be able to query that archived data easily instead of
download the tarball, parse it and "do things" with it.

What do you think about it? It shouldn't take too much time, just create
a quick module to automatically fetch the archive, unzip, parse and punt
into a DBMS (someone said "mongodb"? :)

Honestly I think this service can be easily integrate in Onionoo, but
I'd like to hear your thoughts about that.

> 
> Preference is either GitHub or Bitbucket, and I have seen more people
> use the former than the latter, but that's really up to you.
>
I'll push the code in github since it's my main repo.

> 
> The part that I'd be most interested in is a performance evaluation.
> The question is whether your Elasticsearch stack would handle the load
> that the current Onionoo server handles (or fails to handle at times).
>  Here are some recent statistics from Onionoo:
> 
> Request statistics (2016-05-31 14:50:00, 3600 s):
> Total processed requests: 896066
> Most frequently requested resource: details (894217), summary (1671),
> bandwidth (90)
> Most frequently requested parameter combinations: [lookup, fields]
> (891349), [flag, type, running] (1970), [] (1537)
> Matching relays per request: .500<2, .900<2, .990<2, .999<16384
> Matching bridges per request: .500<1, .900<1, .990<1, .999<8192
> Written characters per response: .500<256, .900<512, .990<512,
> .999<2097152
> Milliseconds to handle request: .500<8, .900<128, .990<2048, .999<4096
> Milliseconds to build response: .500<4, .900<64, .990<1024, .999<8192
> 
> Would you be able to set this up on a powerful and probably not as
> cheap VPS and hammer it with loads of requests?
> 

Well, elasticsearch can scale _a lot_. Problem is: you need to use
multiple servers to do that, like one for the mongodb + 2 for ES.

So I'm not concerned about the workload of the traffic that the stack
can handle, because I know it can handle even more than that (my current
employer use the same stack for tons of terabyte of data and the results
come in almost realtime, but they scale horizontally)

I don't know if I can setup a better server, I'll investigate.

But what do you think if we can try to stress my current setup? Actually
the server is the following:

https://www.hetzner.de/en/hosting/produkte_vserver/cx20

Do you have a tool to stress it? If not, I can develop it (and maybe
reuse it for testing the current and future services).

> 
> Another question would be whether you'd want to make your query
> language available to other websites, so that people who enjoy
> front-end coding more than you and I can build their own website using
> your service.  Basically, you can think of Onionoo and Atlas currently
> implementing three layers:
> 
>  1. Process Tor descriptors to Onionoo documents (Onionoo)
>  2. Respond to requests by Onionoo clients (Onionoo)
>  3. Provide a search field and display results (Atlas)
> 
> You're implementing layers 2 and 3 there, which is perfectly fine.  My
> ask is to enable others to use your layer 2 and implement their own
> layer 3.

Yeah I was thinking the same. At the moment, the backend has a kind-of
POST APIs.

I can make them more RESTFUL implementing the GET method, but exposing
the ES query language through API call (within the GET request) it may
take some time.

If we want it, we can do it, but at this point why not creating a
centralized service that can handle both historical data (CollecTor) and
Onionoo data (that in turn queries CollecTor) ?

I mean, I see a lots of "doubled" services while we can collapse things
instead and make them more efficient.

So my proposal is to focus our efforts in the following:

- create a service that will expose historical and current Tor data
(through CollecTor) or integrate the historical data part into Onionoo

- create an unique service to query that data (Atlas/OnionStats)

- Integrate the metrics graphs of metrics.torproject.org into
Atlas/OnionStats (that was one of my main goal but then I got distracted :)

We just need to create and test a "scalable"  mongodb + ES stack and
we're sorted.

What do you think? Does it make sense?

Enjoy the rest of the day,
Regards

-- 
Anathema

+--------------------------------------------------------------------+
|GPG/PGP KeyID: CFF94F0A available on http://pgpkeys.mit.edu:11371/  |
|Fingerprint: 80CE EC23 2D16 143F 6B25  6776 1960 F6B4 CFF9 4F0A     |
|                                                                    |
|https://keybase.io/davbarbato                                       |
+--------------------------------------------------------------------+