Re: [tor-project] minutes from the sysadmin meeting

12 Jan 2022

      An error crept up in the Metrics of this month and last, see if you can
spot it:

On 2022-01-11 20:34:08, Antoine Beaupré wrote:
...
# Metrics of the month
* hosts in Puppet: 89, LDAP: 91, Prometheus exporters: 139
 * number of Apache servers monitored: 27, hits per second: 185
 * number of Nginx servers: 0, hits per second: 0, hit ratio: 0.00
 * number of self-hosted nameservers: 6, mail servers: 8
 * pending upgrades: 7, reboots: 0
 * average load: 0.35, memory available: 4.01 TiB/5.13 TiB, running processes: 643
 * disk free/total: 84.95 TiB/39.99 TiB
 * bytes sent: 325.45 MB/s, received: 190.66 MB/s
 * planned bullseye upgrades completion date: 2024-09-07
 * [GitLab tickets][]: 159 tickets including...
   * open: 2
   * icebox: 143
   * backlog: 8
   * next: 2
   * doing: 2
   * needs information: 2
   * (closed: 2573)
[Gitlab tickets]: https://gitlab.torproject.org/tpo/tpa/team/-/boards
hint: it's about disk space...

anyone?

credits to roger who figured it out: the disk free/total was
backwards. The correct figure should have read:

 * disk free/total: 38.28 TiB/84.95 TiB

... in this report. Future report shouldn't have this error. It should
also be noted that those metrics should be generally taken with a grain
of salt. The disk query was introduced recently and, in particular,
counts disk usage of the (huge) backup server (60TiB) which itself
keeps a copy of everything by definition.

The network metrics also probably overcount things as we simply do this:

    sum(rate(node_network_transmit_bytes_total[30d]))

... which, in the most likely case you are unfamiliar with Prometheus
and our network infrastructure, may count traffic twice. This will count
internal traffic between network mirrors, for example.

I haven't yet figured out a good (AKA simple) way to fix those queries...

Cheers!

A.

-- 
Antoine Beaupré
torproject.org system administration