[metrics-team] Onionoo: Availability and Performance

Karsten Loesing karsten at torproject.org
Thu Dec 17 16:44:23 UTC 2015


-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

[Wow, I somehow managed to send my previous reply before finishing it,
and Thunderbird doesn't seem to care about the user hitting Cancel
while sending.  Sorry, please find the complete reply below.]


On 17/12/15 16:58, vi wrote:
> Hey all, I'm new.

Hello vi, and welcome!

> About a month ago I was talking with Virgil, and we discovered a
> mutual interest in improving Onionoo's availability. I was getting
> 503s from Globe on a (subjectively) routine basis, and presumably
> this coincided with harm to other downstream clients (Atlas,
> Roster, etc.)
> 
> I'd like to help improve this situation; this message is intended
> both as documentation of my (sporadic, marginal..) work and
> observations thus far (specifically: uptime monitoring, a patch I
> suspect will improve availability, basic runtime performance
> analysis, and patches for a few bugs I noticed along the way), as
> well as a request for assistance, information, and suggestions.
> 
> "It seems down often!" is likely less convincing than real
> downtime measurements, so to start with I rigged up a timer to make
> a cheap request every three minutes and note down problematic
> response codes:
> 
> curl -o /dev/null -s -w "$(date +%s) %{http_code}" \ 
> "https://onionoo.torproject.org/summary?limit=1" \ | awk '$2 !=
> "200"' >> $HOME/onionoo.bad-response
> 
> To my surprise, it's only observed two severe downtime events
> (read: more than two sequential 5xx responses) in the month that
> I've left it running (all times UTC..):
> 
> http://hack.rs/~vi/onionoo/outage-2015-11-24.svg sha256:
> 5e5d18300d1ec53083ce56369a0d6e0cb7d6fc98a80574a865aa2fe8d09bfe74 
> http://www.webcitation.org/6dl2xBpXP
> 
> http://hack.rs/~vi/onionoo/outage-2015-11-27.svg sha256:
> 0626c3f689794e33587b9c3f6e36cd5e43699cda8aed86d617a9004eb020f42a 
> http://www.webcitation.org/6dl2yKRHu
> 
> Which is great! But I think we can do even better. :-)
> 
> I don't know what those specific failures are accountable to
> (though I'm curious to know.. Karsten?) In absence of information,
> I'd guess some variety of resource starvation (saturation leading
> requests to time out consistently, process failure leaving stale
> locks, and such things.)

Thanks for doing this investigation.  You're indeed right that Onionoo
had some serious load problems, but I think I solved them about a week
ago.  Look at this graph:

https://people.torproject.org/~karsten/volatile/onionoo-requests-2015-12-17.png

The issue was that we forgot to enable Apache disk caching after
migrating to another host.  The result was that Onionoo handled each
request, rather than having Apache respond to those where it already
knows the result.  Onionoo responses can be cached quite well, so not
doing so is silly.  Looking at these results, we have even considered
to deploy one or more caches (Squid?) and doing DNS round-robin
between them and the real Onionoo host.  But given that the load
problems seem to be resolved for the moment, we're leaving everything
as it is for now.

Now, what happened on November 24 and 27?  I can't say exactly, but my
guess is that requests kept the Onionoo front-end so busy that the
back-end had trouble updating its data, and then it ran into the
problem you identified below...

> Skimming the code, I noticed some an interesting failure mode in
> the resource servlet; when the most recent sighting of a relay
> recorded on disk is older than six hours, the application will
> respond with 500s (and no useful content) to every subsequent valid
> request until this is righted (and picked up by a node indexer
> thread.) This seems like probable cause for prolonged downtime
> periods (Karsten: have you seen this happen? If so, how frequently,
> and how does the updater fail?)
> 
> This behaviour is especially interesting to me as it seems largely 
> avoidable: we could maintain a 5xx code (thus warning downstream 
> clients), but include outdated results in the body, letting
> clients decide whether they want to fail based on the
> {relays,bridges}_published fields in responses. This would turn an
> unavoidable downstream failure into an preference-based one. While
> any such change probably merits negotiation with client maintainers
> before further consideration, I do have a patch which institutes
> this proposal:
> 
> https://github.com/fmapfmapfmap/onionoo-dev/commit/7e1e340e7b18c07e63788138d1b892239d3b4043
>
> 
http://www.webcitation.org/6dqWAgzbY

You're absolutely right that this is a design bug.  Here's an open
ticket for it:

https://trac.torproject.org/projects/tor/ticket/16907

I'm not entirely clear which fix we should apply.  Would you want to
take a look at the ticket and post if you like the suggested fix, and
if not, suggest the fix above and say why it's better?

> Resource consumption seems a worthwhile candidate for additional 
> investigation; it seems reasonable that sufficient front-end load
> might interrupt operation, and even should this fail to be the
> case, runtime performance improvements can be broadly beneficial.
> 
> I dispatched a load test from my host machine to a virtual machine
> (on the host, for lack of resources) running the most recent public
> revision on Onionoo (only recent descriptors, nothing from
> CollecTor). Target URLs were generated using a Haskell script
> deriving random valid request URLs from a type-level representation
> of the protocol document, with parameter values annotated by
> construction rules. I like to think of it as a fuzzer aspiring to
> find 200s :-). This approach yields request distributions likely
> inconsistent with those found in a production context, but has the
> advantage of likely traversing a greater majority of code paths:
> 
> http://hack.rs/~vi/onionoo/URLGenerator.hs sha256:
> cde747b7fc7d3008a9e69b2eff48d49145f60eece670770ef623368cc41a60f3 
> http://www.webcitation.org/6dldj0YeL 
> https://gist.github.com/fmap/6f44efe6f3ca3f79683b
> 
> siege(1) dispatched 77538 unique requests over an hour; third
> quartile response time was less than a second, with especially
> general requests (i.e. those which aren't filtered much within the
> request handler) responsible for outliers. I'm not sure how
> frequently these are found in production, and could someone derive
> an estimate from access logs, I'd be interested to hear about it.
> 
> Over this load, CPU appeared the most evidently saturated resource 
> (cf. memory use, NIC RX/TX, and storage device I/O), and so is the 
> only component I've profiled in depth so far. Linux performance 
> counters sampled stack traces over the lifetime of the load test,
> with translation of Java symbol tables provided by perf-map-agent.
> I coerced the result into a CPU flame graph, representing the most
> frequently traversed code paths over the duration of the sample:
> 
> http://hack.rs/~vi/onionoo/flame-56e0e01.svg sha256:
> a6044c127072c670e2565d8f88fa0391cec5079adddbac08917b533507bfb938 
> http://www.webcitation.org/6dl2vzqag
> 
> I suspect the opaque frames (e.g. the curiously displaced network
> calls on the left) are accountable to symbol movement over the
> duration of the profile, but haven't determined how best to
> accommodate them yet. Nonetheless, it appears evident to me that
> the majority of the server process' CPU time is spent inside the
> response builder, most of which consumption is in turn accountable
> to JSON serialisation (from in-memory documents to string) and
> de-serialisation (from flat files written to the disk by the
> updater, to memory). The bias is similar when making exclusively
> expensive requests. These seem prime candidates for optimisation.
> 
> I haven't yet determined how best to do so, and intend to focus on
> this problem pending further feedback.

Oh wow, this visualization is tough to follow.  Can you explain a bit
more how to read it?

Regarding your remark about JSON (de-)serialization, note that Onionoo
is already quite a bit optimized towards avoiding that if possible.
That works as follows: once per hour, the updater fetches new data and
in the end produces JSON-formatted strings that it writes to disk.
The servlet reads a (comparatively) small index to memory that it uses
to handle requests, and when it builds responses, it tries hard to
avoid (de-)serializing JSON.

The only situation where this fails is when the request contains the
fields parameter.  Only in that case we'll have to deserialize, pick
the fields we want, and serialize again.  I could imagine that this
shows up in profiles pretty badly, and I'd love to fix this, I just
don't know how.  Here's the relevant code:

https://gitweb.torproject.org/onionoo.git/tree/src/main/java/org/torproject/onionoo/server/ResponseBuilder.java#n175

If you have any ideas here, please open a ticket on Tor's bug tracker.
 There's at least one active Onionoo maintainer who might not watch
this list as closely but who watches Trac.

> One neat side-effect of styling a URL generator after a fuzzer is
> that, between 400s and some clicking around, you can catch a few
> discrepancies between the protocol and its implementation. Here's a
> couple I've noticed, together with patches (these might belong on
> Trac..):
> 
> * The protocol states of the 'fields' parameter:
> 
>> Field names are case-insensitive.
> 
> While the implementation disagrees:
> 
> % for field in as_number AS_Number; do curl -s \ %
> "https://onionoo.torproject.org/details?limit=1&fields=$field" %
> done | jq .relays[] { "as_number": "AS24940" } {}
> 
> Patch:
> 
> https://github.com/fmapfmapfmap/onionoo-dev/commit/1571e2c78486551278855c8d5fa6c82b219d8682
>
> 
http://www.webcitation.org/6dqWFMn4g

Good catch!  Will fix, though I think I'll move the toLowerCase() to
this.parseFieldsParameter(), as in:

    return parameter.toLowerCase().split(",");

The main reason is that I'm trying to avoid touching parameters before
they went through the regex.

> * The protocol describes {first,last}_seen_days:
> 
>> A parameter value "x-y" with x >= y returns relays or bridges
>> that have _ been seen at least x and at most y days ago.
> 
> For "at least x and at most y days" to be inhabited where x /= y,
> a the relation must be x <= y, rather than y >= x. The
> implementation already agrees:
> 
> % curl -o /dev/null -w '%{http_code}\n' -s \ %
> 'https://onionoo.torproject.org/details?limit=1&first_seen_days=2-1'
>
> 
400
> % curl -o /dev/null -w '%{http_code}\n' -s \ %
> 'https://onionoo.torproject.org/details?limit=1&first_seen_days=1-2'
>
> 
200
> 
> Patch:
> 
> https://github.com/fmapfmapfmap/onionoo-dev/commit/71b7f57cb18ccb2cb36f9c6a50b2625b51f08aa2
>
> 
http://www.webcitation.org/6dqWH4Dys

True!  Will fix, too.

> As you've no doubt discerned, I've made only moderate progress so 
> far, but I expect it'll prove beneficial to get this in front of
> more weathered eyes. I'm curious for your thoughts! Especially
> criticisms of methodology (I'm pretty new at this.) In the mean
> while, I'm going to see if I can get any profit out of those
> optimisation candidates.

This is very helpful!  I like the different angles you're taking to
approach the load problems.  Please keep digging!

By the way, I'm not sure if this is helpful, but here are the latest
statistics of the Onionoo frontend:

2015-12-17 16:14:29,860 INFO o.t.o.s.PerformanceMetrics:164 Request
statistics (2015-12-17 16:14:28, 3600 s):
2015-12-17 16:14:29,890 INFO o.t.o.s.PerformanceMetrics:168   Total
processed requests: 14063
2015-12-17 16:14:29,891 INFO o.t.o.s.PerformanceMetrics:170   Most
frequently requested resource: details (13616), bandwidth (176),
weights (170)
2015-12-17 16:14:29,892 INFO o.t.o.s.PerformanceMetrics:172   Most
frequently requested parameter combinations: [lookup, fields] (12716),
[lookup] (956), [search] (278)
2015-12-17 16:14:29,892 INFO o.t.o.s.PerformanceMetrics:174   Matching
relays per request: .500<2, .900<2, .990<2, .999<16384
2015-12-17 16:14:29,892 INFO o.t.o.s.PerformanceMetrics:176   Matching
bridges per request: .500<1, .900<1, .990<1, .999<4096
2015-12-17 16:14:29,892 INFO o.t.o.s.PerformanceMetrics:178   Written
characters per response: .500<256, .900<512, .990<16384, .999<16777216
2015-12-17 16:14:29,893 INFO o.t.o.s.PerformanceMetrics:180
Milliseconds to handle request: .500<8, .900<16, .990<64, .999<256
2015-12-17 16:14:29,893 INFO o.t.o.s.PerformanceMetrics:182
Milliseconds to build response: .500<4, .900<32, .990<256, .999<32768

Please let me know if I left out something above.  Your email
contained a lot of potential issues, and I might have overlooked one
or more of them.  In that case, just let me know and I'll look closer.

Thanks!

> Cheers, vi

All the best,
Karsten

-----BEGIN PGP SIGNATURE-----
Comment: GPGTools - http://gpgtools.org

iQEcBAEBAgAGBQJWcuZnAAoJEJD5dJfVqbCrxZMIAMcztFrxC243573RG+s/I2Ro
X9y6Kl0QH1D+W2UrlZxXRRfYVxRPFRxv+2l5E7dCks+qXs/7WbVY+irVHgCjHh3/
UCFLdVv0hITT63YDE4XyR/MGDRT8B6VQXBGgksugG0quxFPcI5uZPq5bQpoXfy2D
3mB6VMFrz3J/5cpNszxhnxo9lsNOOOOTrO+WAR5zyXNvtDRA5vXKIY0WRXnnYFoa
kW6Crf8a4717aZFGNr7o6Xrl4aUPRrjD170Mw49TSApTczyvrlp984fW89M0jfRg
8cWrl23vs/FlEti9uKGGMTMzmRk5DIlMpnFRTQSpQB/LSx5EF4m1yTvv3NFsd1M=
=9sTD
-----END PGP SIGNATURE-----


More information about the metrics-team mailing list