[metrics-team] Onionoo: Availability and Performance

Thu Dec 17 15:58:30 UTC 2015

Hey all, I'm new.

About a month ago I was talking with Virgil, and we discovered a mutual
interest in improving Onionoo's availability. I was getting 503s from
Globe on a (subjectively) routine basis, and presumably this coincided
with harm to other downstream clients (Atlas, Roster, etc.)

I'd like to help improve this situation; this message is intended both
as documentation of my (sporadic, marginal..) work and observations thus
far (specifically: uptime monitoring, a patch I suspect will improve
availability, basic runtime performance analysis, and patches for a
few bugs I noticed along the way), as well as a request for assistance,
information, and suggestions.

"It seems down often!" is likely less convincing than real downtime
measurements, so to start with I rigged up a timer to make a cheap
request every three minutes and note down problematic response codes:

  curl -o /dev/null -s -w "$(date +%s) %{http_code}" \
    "https://onionoo.torproject.org/summary?limit=1" \
  | awk '$2 != "200"' >> $HOME/onionoo.bad-response

To my surprise, it's only observed two severe downtime events (read:
more than two sequential 5xx responses) in the month that I've left
it running (all times UTC..):

  http://hack.rs/~vi/onionoo/outage-2015-11-24.svg
  sha256: 5e5d18300d1ec53083ce56369a0d6e0cb7d6fc98a80574a865aa2fe8d09bfe74
  http://www.webcitation.org/6dl2xBpXP

  http://hack.rs/~vi/onionoo/outage-2015-11-27.svg
  sha256: 0626c3f689794e33587b9c3f6e36cd5e43699cda8aed86d617a9004eb020f42a
  http://www.webcitation.org/6dl2yKRHu

Which is great! But I think we can do even better. :-)

I don't know what those specific failures are accountable to (though I'm
curious to know.. Karsten?) In absence of information, I'd guess some
variety of resource starvation (saturation leading requests to time out
consistently, process failure leaving stale locks, and such things.)

Skimming the code, I noticed some an interesting failure mode in the
resource servlet; when the most recent sighting of a relay recorded on
disk is older than six hours, the application will respond with 500s
(and no useful content) to every subsequent valid request until this
is righted (and picked up by a node indexer thread.) This seems like
probable cause for prolonged downtime periods (Karsten: have you seen
this happen? If so, how frequently, and how does the updater fail?)

This behaviour is especially interesting to me as it seems largely
avoidable: we could maintain a 5xx code (thus warning downstream
clients), but include outdated results in the body, letting clients
decide whether they want to fail based on the {relays,bridges}_published
fields in responses. This would turn an unavoidable downstream failure
into an preference-based one. While any such change probably merits
negotiation with client maintainers before further consideration, I do
have a patch which institutes this proposal:

  https://github.com/fmapfmapfmap/onionoo-dev/commit/7e1e340e7b18c07e63788138d1b892239d3b4043
  http://www.webcitation.org/6dqWAgzbY

Resource consumption seems a worthwhile candidate for additional
investigation; it seems reasonable that sufficient front-end load might
interrupt operation, and even should this fail to be the case, runtime
performance improvements can be broadly beneficial.

I dispatched a load test from my host machine to a virtual machine (on
the host, for lack of resources) running the most recent public revision
on Onionoo (only recent descriptors, nothing from CollecTor). Target
URLs were generated using a Haskell script deriving random valid request
URLs from a type-level representation of the protocol document, with
parameter values annotated by construction rules. I like to think of
it as a fuzzer aspiring to find 200s :-). This approach yields request
distributions likely inconsistent with those found in a production
context, but has the advantage of likely traversing a greater majority
of code paths:

  http://hack.rs/~vi/onionoo/URLGenerator.hs
  sha256: cde747b7fc7d3008a9e69b2eff48d49145f60eece670770ef623368cc41a60f3
  http://www.webcitation.org/6dldj0YeL
  https://gist.github.com/fmap/6f44efe6f3ca3f79683b

siege(1) dispatched 77538 unique requests over an hour; third quartile
response time was less than a second, with especially general requests
(i.e. those which aren't filtered much within the request handler)
responsible for outliers. I'm not sure how frequently these are found in
production, and could someone derive an estimate from access logs, I'd
be interested to hear about it.

Over this load, CPU appeared the most evidently saturated resource
(cf. memory use, NIC RX/TX, and storage device I/O), and so is the
only component I've profiled in depth so far. Linux performance
counters sampled stack traces over the lifetime of the load test, with
translation of Java symbol tables provided by perf-map-agent. I coerced
the result into a CPU flame graph, representing the most frequently
traversed code paths over the duration of the sample:

  http://hack.rs/~vi/onionoo/flame-56e0e01.svg
  sha256: a6044c127072c670e2565d8f88fa0391cec5079adddbac08917b533507bfb938
  http://www.webcitation.org/6dl2vzqag

I suspect the opaque frames (e.g. the curiously displaced network calls
on the left) are accountable to symbol movement over the duration
of the profile, but haven't determined how best to accommodate them
yet. Nonetheless, it appears evident to me that the majority of the
server process' CPU time is spent inside the response builder, most of
which consumption is in turn accountable to JSON serialisation (from
in-memory documents to string) and de-serialisation (from flat files
written to the disk by the updater, to memory). The bias is similar when
making exclusively expensive requests. These seem prime candidates for
optimisation.

I haven't yet determined how best to do so, and intend to focus on this
problem pending further feedback.

One neat side-effect of styling a URL generator after a fuzzer is that,
between 400s and some clicking around, you can catch a few discrepancies
between the protocol and its implementation. Here's a couple I've
noticed, together with patches (these might belong on Trac..):

  * The protocol states of the 'fields' parameter:

    > Field names are case-insensitive.

    While the implementation disagrees:

      % for field in as_number AS_Number; do curl -s \
      %   "https://onionoo.torproject.org/details?limit=1&fields=$field"
      % done | jq .relays[]
      {
        "as_number": "AS24940"
      }
      {}

    Patch: 

      https://github.com/fmapfmapfmap/onionoo-dev/commit/1571e2c78486551278855c8d5fa6c82b219d8682
      http://www.webcitation.org/6dqWFMn4g

  * The protocol describes {first,last}_seen_days:

    > A parameter value "x-y" with x >= y returns relays or bridges that
    > have _ been seen at least x and at most y days ago.

    For "at least x and at most y days" to be inhabited where x /= y, a
    the relation must be x <= y, rather than y >= x. The implementation
    already agrees:

      % curl -o /dev/null -w '%{http_code}\n' -s \
      %   'https://onionoo.torproject.org/details?limit=1&first_seen_days=2-1'
      400
      % curl -o /dev/null -w '%{http_code}\n' -s \
      %   'https://onionoo.torproject.org/details?limit=1&first_seen_days=1-2'
      200

    Patch: 

      https://github.com/fmapfmapfmap/onionoo-dev/commit/71b7f57cb18ccb2cb36f9c6a50b2625b51f08aa2
      http://www.webcitation.org/6dqWH4Dys

As you've no doubt discerned, I've made only moderate progress so
far, but I expect it'll prove beneficial to get this in front of more
weathered eyes. I'm curious for your thoughts! Especially criticisms of
methodology (I'm pretty new at this.) In the mean while, I'm going to
see if I can get any profit out of those optimisation candidates.

Cheers,
vi