[metrics-bugs] #32660 [Metrics/Onionoo]: onionoo-backend is killing the ganeti cluster

Fri Dec 6 16:55:54 UTC 2019

#32660: onionoo-backend is killing the ganeti cluster
-----------------------------+------------------------------
 Reporter:  anarcat          |          Owner:  metrics-team
     Type:  defect           |         Status:  closed
 Priority:  Medium           |      Milestone:
Component:  Metrics/Onionoo  |        Version:
 Severity:  Normal           |     Resolution:  fixed
 Keywords:                   |  Actual Points:
Parent ID:                   |         Points:
 Reviewer:                   |        Sponsor:
-----------------------------+------------------------------
Changes (by anarcat):

 * status:  merge_ready => closed
 * resolution:   => fixed

Old description:

> hello!
>
> today i noticed that, since last friday (UTC) morning, there has been
> pretty big spikes on the internal network between the ganeti nodes, every
> hour. it looks like this, in grafana:
>
> [[Image(snap-2019.12.02-16.06.11.png)]]
>
> We can clearly see a correlation between the two node's traffic, in
> reverse. This was confirmed using `iftop` and `tcpdump` on the nodes
> during a surge.
>
> It seems this is due to onionoo-backend-01 blasting the disk and CPU for
> some reason. This is the disk I/O graphs for that host, which correlate
> pretty cleanly with the above graphs:
>
> [[Image(snap-2019.12.02-16.30.33.png)]]
>
> This was confirmed by an inspection of `drbd`, the mechanisms that
> synchronizes the disks across the network. It seems there's a huge surge
> of "writes" on the network every hour which lasts anywhere between 20 and
> 30 minutes. This was (somewhat) confirmed by running:
>
> {{{
> watch -n 0.1 -d cat /proc/drbd
> }}}
>
> on the nodes. The device IDs 4, 13 and 17 trigger a lot of changes in
> DRBD. 13 and 17 are the web nodes, so that's expected - probably log
> writes? But device ID 4 is onionoo-backend, which is what led me to the
> big traffic graph.
>
> could someone from metrics investigate?
>
> can i just turn off this machine altogether, considering it's basically
> trying to murder the cluster every hour? :)

New description:

 hello!

 today i noticed that, since last friday (UTC) morning, there has been
 pretty big spikes on the internal network between the ganeti nodes, every
 hour. it looks like this, in grafana:

 [[Image(snap-2019.12.02-16.06.11.png, 700)]]

 We can clearly see a correlation between the two node's traffic, in
 reverse. This was confirmed using `iftop` and `tcpdump` on the nodes
 during a surge.

 It seems this is due to onionoo-backend-01 blasting the disk and CPU for
 some reason. This is the disk I/O graphs for that host, which correlate
 pretty cleanly with the above graphs:

 [[Image(snap-2019.12.02-16.30.33.png)]]

 This was confirmed by an inspection of `drbd`, the mechanisms that
 synchronizes the disks across the network. It seems there's a huge surge
 of "writes" on the network every hour which lasts anywhere between 20 and
 30 minutes. This was (somewhat) confirmed by running:

 {{{
 watch -n 0.1 -d cat /proc/drbd
 }}}

 on the nodes. The device IDs 4, 13 and 17 trigger a lot of changes in
 DRBD. 13 and 17 are the web nodes, so that's expected - probably log
 writes? But device ID 4 is onionoo-backend, which is what led me to the
 big traffic graph.

 could someone from metrics investigate?

 can i just turn off this machine altogether, considering it's basically
 trying to murder the cluster every hour? :)

--

Comment:

 wow, that *is* a huge improvement! check this out:

 https://grafana.torproject.org/d/ER3U2cqmk/node-exporter-server-
 metrics?orgId=1&from=1575563766753&to=1575650166753&var-
 node=omeiense.torproject.org:9100&var-node=oo-hetzner-03.torproject.org

 in particular:

 [[Image(snap-2019.12.06-11.36.33.png, 700)]]

 large reduction in CPU and memory usage, significant reduction in load!

 [[Image(snap-2019.12.06-11.43.28.png, 700)]]

 also a *dramatic* reduction in disk utilization! especially: all that
 writing was significantly reduced... but what i find the most interesting
 is this:

 [[Image(snap-2019.12.06-11.49.27.png, 700)]]

 ie. we write less, but we don't read more! even though we're computing all
 those checksums, we don't impose extra load on the disks because of that
 reading, which is one thing I was worried about.

 but even if we would read more (which we don't) it would still be a
 worthwhile tradeoffs because (1) we can cache those and (2) we (obviously)
 don't need to replicate reads across the cluster.

 i can't confirm the effect on the actual ganeti cluster because irl
 (thankfully! :) has turned off those jobs on onionoo-backend-01. but i'm
 not confident the cluster will be happier with this work if/when we turn
 it back on.

 thank you so much for taking the extra time in fixing this and taking care
 of our hardware. sometimes it's easier to throw hardware at a problem, but
 this seemed like a case where we could improve our algos a little, and I'm
 glad it worked out. :)

 all in all, i think this can be marked as fixed, at least it is for me.
 i'll let other tickets speak for the rest of the work on this onionoo
 stuff. from what i understand, there needs to be extra work to bring that
 other backend online (or build a new one?) but i'll let you folks figure
 out the next steps. :)

 do ping me if you need help on that!

 cheers, and thanks again!

--
Ticket URL: <https://trac.torproject.org/projects/tor/ticket/32660#comment:15>
Tor Bug Tracker & Wiki <https://trac.torproject.org/>
The Tor Project: anonymity online