An error crept up in the Metrics of this month and last, see if you can spot it:
On 2022-01-11 20:34:08, Antoine Beaupré wrote:
# Metrics of the month
- hosts in Puppet: 89, LDAP: 91, Prometheus exporters: 139
- number of Apache servers monitored: 27, hits per second: 185
- number of Nginx servers: 0, hits per second: 0, hit ratio: 0.00
- number of self-hosted nameservers: 6, mail servers: 8
- pending upgrades: 7, reboots: 0
- average load: 0.35, memory available: 4.01 TiB/5.13 TiB, running processes: 643
- disk free/total: 84.95 TiB/39.99 TiB
- bytes sent: 325.45 MB/s, received: 190.66 MB/s
- planned bullseye upgrades completion date: 2024-09-07
- [GitLab tickets][]: 159 tickets including...
- open: 2
- icebox: 143
- backlog: 8
- next: 2
- doing: 2
- needs information: 2
- (closed: 2573)
hint: it's about disk space...
anyone?
credits to roger who figured it out: the disk free/total was backwards. The correct figure should have read:
* disk free/total: 38.28 TiB/84.95 TiB
... in this report. Future report shouldn't have this error. It should also be noted that those metrics should be generally taken with a grain of salt. The disk query was introduced recently and, in particular, counts disk usage of the (huge) backup server (60TiB) which itself keeps a copy of everything by definition.
The network metrics also probably overcount things as we simply do this:
sum(rate(node_network_transmit_bytes_total[30d]))
... which, in the most likely case you are unfamiliar with Prometheus and our network infrastructure, may count traffic twice. This will count internal traffic between network mirrors, for example.
I haven't yet figured out a good (AKA simple) way to fix those queries...
Cheers!
A.