[metrics-team] Onionoo: Availability and Performance

Thu Dec 17 16:34:50 UTC 2015

On 17/12/15 16:58, vi wrote:
> Hey all, I'm new.

Hello vi, and welcome!

> About a month ago I was talking with Virgil, and we discovered a mutual
> interest in improving Onionoo's availability. I was getting 503s from
> Globe on a (subjectively) routine basis, and presumably this coincided
> with harm to other downstream clients (Atlas, Roster, etc.)
> 
> I'd like to help improve this situation; this message is intended both
> as documentation of my (sporadic, marginal..) work and observations thus
> far (specifically: uptime monitoring, a patch I suspect will improve
> availability, basic runtime performance analysis, and patches for a
> few bugs I noticed along the way), as well as a request for assistance,
> information, and suggestions.
> 
> "It seems down often!" is likely less convincing than real downtime
> measurements, so to start with I rigged up a timer to make a cheap
> request every three minutes and note down problematic response codes:
> 
>   curl -o /dev/null -s -w "$(date +%s) %{http_code}" \
>     "https://onionoo.torproject.org/summary?limit=1" \
>   | awk '$2 != "200"' >> $HOME/onionoo.bad-response
> 
> To my surprise, it's only observed two severe downtime events (read:
> more than two sequential 5xx responses) in the month that I've left
> it running (all times UTC..):
> 
>   http://hack.rs/~vi/onionoo/outage-2015-11-24.svg
>   sha256: 5e5d18300d1ec53083ce56369a0d6e0cb7d6fc98a80574a865aa2fe8d09bfe74
>   http://www.webcitation.org/6dl2xBpXP
> 
>   http://hack.rs/~vi/onionoo/outage-2015-11-27.svg
>   sha256: 0626c3f689794e33587b9c3f6e36cd5e43699cda8aed86d617a9004eb020f42a
>   http://www.webcitation.org/6dl2yKRHu
> 
> Which is great! But I think we can do even better. :-)
> 
> I don't know what those specific failures are accountable to (though I'm
> curious to know.. Karsten?) In absence of information, I'd guess some
> variety of resource starvation (saturation leading requests to time out
> consistently, process failure leaving stale locks, and such things.)

Thanks for doing this investigation.  You're indeed right that Onionoo
had some serious load problems, but I think I solved them about a week
ago.  Look at this graph:

https://people.torproject.org/~karsten/volatile/onionoo-requests-2015-12-17.png

The issue was that we forgot to enable Apache disk caching after
migrating to another host.  The result was that Onionoo handled each
request, rather than having Apache respond to those where it already
knows the result.  Onionoo responses can be cached quite well, so not
doing so is silly.  Looking at these results, we have even considered to
deploy one or more caches (Squid?) and doing DNS round-robin between
them and the real Onionoo host.  But given that the load problems seem
to be resolved for the moment, we're leaving everything as it is for now.

Now, what happened on November 24 and 27?  I can't say exactly, but my
guess is that requests kept the Onionoo front-end so busy that the
back-end had trouble updating its data, and then it ran into the problem
you identified below...

> Skimming the code, I noticed some an interesting failure mode in the
> resource servlet; when the most recent sighting of a relay recorded on
> disk is older than six hours, the application will respond with 500s
> (and no useful content) to every subsequent valid request until this
> is righted (and picked up by a node indexer thread.) This seems like
> probable cause for prolonged downtime periods (Karsten: have you seen
> this happen? If so, how frequently, and how does the updater fail?)
> 
> This behaviour is especially interesting to me as it seems largely
> avoidable: we could maintain a 5xx code (thus warning downstream
> clients), but include outdated results in the body, letting clients
> decide whether they want to fail based on the {relays,bridges}_published
> fields in responses. This would turn an unavoidable downstream failure
> into an preference-based one. While any such change probably merits
> negotiation with client maintainers before further consideration, I do
> have a patch which institutes this proposal:
> 
>   https://github.com/fmapfmapfmap/onionoo-dev/commit/7e1e340e7b18c07e63788138d1b892239d3b4043
>   http://www.webcitation.org/6dqWAgzbY

You're absolutely right that this is a design bug.  Here's an open
ticket for it:

https://trac.torproject.org/projects/tor/ticket/16907

I'm not entirely clear which fix we should apply.  Would you want to
take a look at the ticket and post if you like the suggested fix, and if
not, suggest the fix above and say why it's better?

> Resource consumption seems a worthwhile candidate for additional
> investigation; it seems reasonable that sufficient front-end load might
> interrupt operation, and even should this fail to be the case, runtime
> performance improvements can be broadly beneficial.
> 
> I dispatched a load test from my host machine to a virtual machine (on
> the host, for lack of resources) running the most recent public revision
> on Onionoo (only recent descriptors, nothing from CollecTor). Target
> URLs were generated using a Haskell script deriving random valid request
> URLs from a type-level representation of the protocol document, with
> parameter values annotated by construction rules. I like to think of
> it as a fuzzer aspiring to find 200s :-). This approach yields request
> distributions likely inconsistent with those found in a production
> context, but has the advantage of likely traversing a greater majority
> of code paths:
>   
>   http://hack.rs/~vi/onionoo/URLGenerator.hs
>   sha256: cde747b7fc7d3008a9e69b2eff48d49145f60eece670770ef623368cc41a60f3
>   http://www.webcitation.org/6dldj0YeL
>   https://gist.github.com/fmap/6f44efe6f3ca3f79683b
> 
> siege(1) dispatched 77538 unique requests over an hour; third quartile
> response time was less than a second, with especially general requests
> (i.e. those which aren't filtered much within the request handler)
> responsible for outliers. I'm not sure how frequently these are found in
> production, and could someone derive an estimate from access logs, I'd
> be interested to hear about it.
> 
> Over this load, CPU appeared the most evidently saturated resource
> (cf. memory use, NIC RX/TX, and storage device I/O), and so is the
> only component I've profiled in depth so far. Linux performance
> counters sampled stack traces over the lifetime of the load test, with
> translation of Java symbol tables provided by perf-map-agent. I coerced
> the result into a CPU flame graph, representing the most frequently
> traversed code paths over the duration of the sample:
> 
>   http://hack.rs/~vi/onionoo/flame-56e0e01.svg
>   sha256: a6044c127072c670e2565d8f88fa0391cec5079adddbac08917b533507bfb938
>   http://www.webcitation.org/6dl2vzqag
> 
> I suspect the opaque frames (e.g. the curiously displaced network calls
> on the left) are accountable to symbol movement over the duration
> of the profile, but haven't determined how best to accommodate them
> yet. Nonetheless, it appears evident to me that the majority of the
> server process' CPU time is spent inside the response builder, most of
> which consumption is in turn accountable to JSON serialisation (from
> in-memory documents to string) and de-serialisation (from flat files
> written to the disk by the updater, to memory). The bias is similar when
> making exclusively expensive requests. These seem prime candidates for
> optimisation.
> 
> I haven't yet determined how best to do so, and intend to focus on this
> problem pending further feedback.

Oh wow, this visualization is tough to follow.  Can you explain a bit
more how to read it?

Regarding your remark about JSON (de-)serialization, note that Onionoo
is already quite a bit optimized towards avoiding that if possible.
That works as follows: once per hour, the updater fetches new data and
in the end produces JSON-formatted strings that it writes to disk.  The
servlet reads a (comparatively) small index to memory that it uses to
handle requests, and when it builds responses, it tries hard to avoid
(de-)serializing JSON.

The only situation where this fails is when the request contains the
fields parameter.  Only in that case we'll have to deserialize, pick the
fields we want, and serialize again.  I could imagine that this shows up
in profiles pretty badly, and I'd love to fix this, I just don't know
how.  Here's the relevant code:

https://gitweb.torproject.org/onionoo.git/tree/src/main/java/org/torproject/onionoo/server/ResponseBuilder.java#n175

If you have any ideas here, please open a ticket on Tor's bug tracker.
There's at least one active Onionoo maintainer who might not watch this
list as closely but who watches Trac.

> One neat side-effect of styling a URL generator after a fuzzer is that,
> between 400s and some clicking around, you can catch a few discrepancies
> between the protocol and its implementation. Here's a couple I've
> noticed, together with patches (these might belong on Trac..):
>   
>   * The protocol states of the 'fields' parameter:
> 
>     > Field names are case-insensitive.
> 
>     While the implementation disagrees:
> 
>       % for field in as_number AS_Number; do curl -s \
>       %   "https://onionoo.torproject.org/details?limit=1&fields=$field"
>       % done | jq .relays[]
>       {
>         "as_number": "AS24940"
>       }
>       {}
> 
>     Patch: 
> 
>       https://github.com/fmapfmapfmap/onionoo-dev/commit/1571e2c78486551278855c8d5fa6c82b219d8682
>       http://www.webcitation.org/6dqWFMn4g

Good catch!  Will fix, though I think I'll move the toLower.

    return parameter.split(",");

>   * The protocol describes {first,last}_seen_days:
> 
>     > A parameter value "x-y" with x >= y returns relays or bridges that
>     > have _ been seen at least x and at most y days ago.
> 
>     For "at least x and at most y days" to be inhabited where x /= y, a
>     the relation must be x <= y, rather than y >= x. The implementation
>     already agrees:
> 
>       % curl -o /dev/null -w '%{http_code}\n' -s \
>       %   'https://onionoo.torproject.org/details?limit=1&first_seen_days=2-1'
>       400
>       % curl -o /dev/null -w '%{http_code}\n' -s \
>       %   'https://onionoo.torproject.org/details?limit=1&first_seen_days=1-2'
>       200
> 
>     Patch: 
> 
>       https://github.com/fmapfmapfmap/onionoo-dev/commit/71b7f57cb18ccb2cb36f9c6a50b2625b51f08aa2
>       http://www.webcitation.org/6dqWH4Dys

True!  Will fix, too.

> As you've no doubt discerned, I've made only moderate progress so
> far, but I expect it'll prove beneficial to get this in front of more
> weathered eyes. I'm curious for your thoughts! Especially criticisms of
> methodology (I'm pretty new at this.) In the mean while, I'm going to
> see if I can get any profit out of those optimisation candidates.
> 
> Cheers,
> vi
>