[metrics-bugs] #32660 [Metrics/Onionoo]: onionoo-backend is killing the ganeti cluster

Tor Bug Tracker & Wiki blackhole at torproject.org
Mon Dec 2 21:55:07 UTC 2019


#32660: onionoo-backend is killing the ganeti cluster
-----------------------------+------------------------------
 Reporter:  anarcat          |          Owner:  metrics-team
     Type:  defect           |         Status:  new
 Priority:  Medium           |      Milestone:
Component:  Metrics/Onionoo  |        Version:
 Severity:  Normal           |     Resolution:
 Keywords:                   |  Actual Points:
Parent ID:                   |         Points:
 Reviewer:                   |        Sponsor:
-----------------------------+------------------------------

Old description:

> hello!
>
> today i noticed that, since last friday (UTC) morning, there has been
> pretty big spikes on the internal network between the ganeti nodes, every
> hour. it seems this is due to onionoo-backend-01 blasting the disk and
> CPU for some reason.
>
> could someone from metrics investigate? can i just turn off this machine
> altogether, considering it's basically trying to murder the cluster every
> hour? :)
>
> (will attach explanatory screenshots)

New description:

 hello!

 today i noticed that, since last friday (UTC) morning, there has been
 pretty big spikes on the internal network between the ganeti nodes, every
 hour. it looks like this, in grafana:

 [[Image(snap-2019.12.02-16.06.11.png​)]]

 We can clearly see a correlation between the two node's traffic, in
 reverse. This was confirmed using `iftop` and `tcpdump` on the nodes
 during a surge.

 It seems this is due to onionoo-backend-01 blasting the disk and CPU for
 some reason. This is the disk I/O graphs for that host, which correlate
 pretty cleanly with the above graphs:

 [[Image(snap-2019.12.02-16.30.33.png​)]]

 This was confirmed by an inspection of `drbd`, the mechanisms that
 synchronizes the disks across the network. It seems there's a huge surge
 of "writes" on the network every hour which lasts anywhere between 20 and
 30 minutes. This was (somewhat) confirmed by running:

 {{{
 watch -n 0.1 -d cat /proc/drbd
 }}}

 on the nodes. The device IDs 4, 13 and 17 trigger a lot of changes in
 DRBD. 13 and 17 are the web nodes, so that's expected - probably log
 writes? But device ID 4 is onionoo-backend, which is what led me to the
 big traffic graph.

 could someone from metrics investigate?

 can i just turn off this machine altogether, considering it's basically
 trying to murder the cluster every hour? :)

--

Comment (by anarcat):

 attach screenshots and further explanations.

 the TL;DR: here is: can i shutdown this backend?

--
Ticket URL: <https://trac.torproject.org/projects/tor/ticket/32660#comment:1>
Tor Bug Tracker & Wiki <https://trac.torproject.org/>
The Tor Project: anonymity online


More information about the metrics-bugs mailing list