[metrics-team] OnionStats

Karsten Loesing karsten at torproject.org
Tue May 31 15:30:55 UTC 2016


-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 31/05/16 02:17, Anathema wrote:
> Hi everyone,

Hi Anathema,

> I've participated to the last metrics team meeting on IRC, and
> I've "presented" a new tool. Since I'll not be able to attend the
> next meeting, I'm going to present the full project here.

As you may have seen there won't be a meeting this week.  So maybe you
can make it to the meeting next week?  But in any case, presenting
your project in this list is a good idea, because there's only so much
time at the IRC meeting for each topic, so that is best spent if
people are prepared.  Thanks for sharing your project here.

> A little bit of background: I was trying to find out some
> information related to Tor node stats like "how many nodes from
> country X have been activated in the last Y months" or "how many
> hosts with hostname X on platform Y there are" and so on. I found
> Atlas really helpful in some aspects but not so good in others, 
> mainly I was not able to answer the above questions. Plus, there
> are some cons: - it's slow

To be fair, it's not Atlas that is slow but Onionoo.  I hope to fix
that in the next few weeks when we move Onionoo to a new host.  Let's see.

> - it returns limited results

Do you mean the limit of returning no more than 40 results?  If so,
yes, that's unfortunate.

> - its query language doesn't allow complex and combined queries

Agreed.  Again, Onionoo is to blame here.

> So I started writing a tool that fulfill my requirements. When it
> was almost finished, I thought: "well this is cool, maybe the Tor
> community can be interested in it". And here we are.
> 
> What's all this about? OnionStats First, a note: the software
> described below can be integrated in Atlas. I created one from
> scratch because was easier for me, but if we don't want to use two
> services we can think about integrating mine into Atlas or Atlas
> into mine.
> 
> So the name is OnionStats because, you know, Tor, Onionoo, onions.
> (it was TorStats but then Karsten suggested a better name :)
> 
> The software stack is as follow: - Semantic-ui + jQuery as a
> frontend - Tornado as a backend - Monogdb as a DBMS - Elasticsearch
> as a search engine
> 
> Here is the link to a live instance: http://138.201.90.124:8080
> (it's a cheap VPS so it may be slow due to a lack of resources -
> please be gentle and don't hammer it).

Uhm, I may be overlooking something, but it always tells me 0 results.
 Is that because of recent Onionoo troubles, or am I doing something
wrong?  Can you give a sample query that should return results?

> How things work: Basically, there is a python script that runs in
> the background (cron) every 12h that fetches the nodes information
> using Onioon protocol and save the information into the mongodb
> schema. mongodb-collector automatically pushes the data into
> Elasticsearch. When you search through the web UI, the backend
> makes an Elasticsearch query and returns the data back to the web
> UI which displays the data.
> 
> Easy, clean, fast.
> 
> Pro: - it's fast. Really. - huge results cap: I've hardcoded a
> limit of 2000 results per query for testing but it can be easily
> increased in production with better hardware. - easy to audit:
> Atlas is made of AngularJS, which is great but for someone who
> doens't know anything about it, it's a big learning curve. I think
> that's a little bit overkill. My code is just plain jQuery and 
> DataTables. That's all I needed. - complex queries: it can be
> possible to leverage almost all the Elasticsearch syntax features.
> More information in the "Syntax" section
> 
> Cons: - updates data every 12h. In the IRC meeting someone told me
> that I can decrease the sleep time, so it may be possible to reduce
> to every 6h or maybe 1h?

Yep, feel free to decrease that to once per hour or even more often,
as long as you set the `If-Modified-Since` header in your request.  In
that case you're not wasting any bandwidth or cycles if the data has
not changed.  You could even set that to every 5 minutes, but please
please make sure you have set that header.

And just to be clear, whenever you fetch new data you're throwing out
the old data, right?  That is, you're not creating an archive of
Onionoo data?  I think I mentioned this concern before: there exist no
archives of Onionoo data, just archives of the underlying Tor
descriptors, so whenever you'd set up your service would be the
earliest date in your own archive.  And whenever your service
temporarily fails is when your archive has a gap.  So don't do that.
Just kill the old data when you're getting new data and rely on
Onionoo to be your archive.  Alternatively, go for original
descriptors provided by DescripTor, but that's a very different
project then.

> There are few HTML glitches so I apologize, I'm not a frontend
> coder and I'll try to fix them ASAP.
> 
> I didn't push the code to my <github|bitbucket> repository but if
> you want to take a look at the code I'm more then welcome to
> publish it, just let me know which of the two do you prefer or if
> you prefer another way of sharing (like a link to a tarball on the
> server).

Preference is either GitHub or Bitbucket, and I have seen more people
use the former than the latter, but that's really up to you.

> Hope you like it and I'd be more than welcome to help integrate it
> into Atlas or integrate some Atlas' feature into OnionStats (and
> maybe, find a better name :)

The part that I'd be most interested in is a performance evaluation.
The question is whether your Elasticsearch stack would handle the load
that the current Onionoo server handles (or fails to handle at times).
 Here are some recent statistics from Onionoo:

Request statistics (2016-05-31 14:50:00, 3600 s):
Total processed requests: 896066
Most frequently requested resource: details (894217), summary (1671),
bandwidth (90)
Most frequently requested parameter combinations: [lookup, fields]
(891349), [flag, type, running] (1970), [] (1537)
Matching relays per request: .500<2, .900<2, .990<2, .999<16384
Matching bridges per request: .500<1, .900<1, .990<1, .999<8192
Written characters per response: .500<256, .900<512, .990<512,
.999<2097152
Milliseconds to handle request: .500<8, .900<128, .990<2048, .999<4096
Milliseconds to build response: .500<4, .900<64, .990<1024, .999<8192

Would you be able to set this up on a powerful and probably not as
cheap VPS and hammer it with loads of requests?


Another question would be whether you'd want to make your query
language available to other websites, so that people who enjoy
front-end coding more than you and I can build their own website using
your service.  Basically, you can think of Onionoo and Atlas currently
implementing three layers:

 1. Process Tor descriptors to Onionoo documents (Onionoo)
 2. Respond to requests by Onionoo clients (Onionoo)
 3. Provide a search field and display results (Atlas)

You're implementing layers 2 and 3 there, which is perfectly fine.  My
ask is to enable others to use your layer 2 and implement their own
layer 3.

> Let me know what you think.

Very interesting stuff! :)

> Thank you, Regards

All the best,
Karsten

-----BEGIN PGP SIGNATURE-----
Comment: GPGTools - http://gpgtools.org

iQEcBAEBAgAGBQJXTa4vAAoJEC3ESO/4X7XB9MMH/1HwrJ9eVWTcFFie3lpf5+wu
cENAQmJHeu9BidEzcUEXq2pZEpcv2+A93rbkWxgfh5+YpLHFA3qVJtkXOOofzEvA
AGEt0CRqhzTlviguOkxIxuKVOln+icggMkWeEU24kRpHgjlm3yWXm8GK1uH3zcWg
I0IaSrcpe0YDHZf12oE9bnwBu9M8Pk5S9TFfi1eTVZFrTWbTdxEq4g5IVbTYNzxI
3JnUoDd3cGW5d3uyjpk1zHc3DcoRc7A2HPZgzkKBWy6SxhFPiUweWRLM2r9C1Cda
ifXzE8iDGcnyAzGnGhsFm6HMceDi/X0waeH4zo2SlkEen/2e2sgRO8ewvl8Lpyc=
=YKlg
-----END PGP SIGNATURE-----


More information about the metrics-team mailing list