[tor-bugs] #29864 [Internal Services/Tor Sysadmin Team]: consider replacing nagios with prometheus

Mon Apr 15 17:54:07 UTC 2019

#29864: consider replacing nagios with prometheus
-------------------------------------------------+---------------------
 Reporter:  anarcat                              |          Owner:  tpa
     Type:  project                              |         Status:  new
 Priority:  Low                                  |      Milestone:
Component:  Internal Services/Tor Sysadmin Team  |        Version:
 Severity:  Major                                |     Resolution:
 Keywords:                                       |  Actual Points:
Parent ID:                                       |         Points:
 Reviewer:                                       |        Sponsor:
-------------------------------------------------+---------------------
Description changed by anarcat:

Old description:

> As a followup to the Prometheus/Grafana setup started in #29681, I am
> wondering if we should also consider replacing the Nagios/Icinga server
> with Prometheus. I have done a little research on the subject and figured
> it might be good to at least document the current state of affairs.
>
> This would remove a complex piece of architecture we have at TPO that was
> designed before Puppet was properly deployed. Prometheus has an
> interesting federated design that allows it to scale to multiple machines
> easily, along with a high availability component for the alertmanager
> that allows it to be more reliable than a traditionnal Nagios
> configuration. It would also simplify our architecture as the Nagios
> server automation is a complex mix of Debian packages and git hooks that
> is serving us well, but hard to comprehend and debug for new
> administrators. (I managed to wipe the entire Nagios config myself on my
> first week on the job by messing up a configuration file.) Having the
> monitoring server fully deployed by Puppet would be a huge improvement,
> even if it would be done with Nagios instead of Prometheus, of course.
>
> Right now the Nagios server is actually running Icinga 1.13, a Nagios
> fork, on a heztner machine (`hetzner-hel1-01`). It's doing its job
> generally well although it feels a *little* noisy, but that's to be
> expected form Nagios servers. Reducing the number of alerts seems to be
> an objective, explicitely documented in #29410, for example.
>
> Both Grafana and Prometheus can do alerting, with various mechanisms and
> plugins. I haven't investigated those deeply, but in general that's not a
> problem in alerting: you fire some script or API and the rest happens. I
> suspect we could port the current Nagios alerting scripts to Prometheus
> fairly easily, although I haven't investigated our scripts in details.
>
> The problem is reproducing the check scripts and their associated alert
> threshold. In the Nagios world, when a check is installed, it *comes*
> with its own health ("OK", "WARNING", "CRITICAL") threshold and TPO has
> developed a wide variety of such checks. According to the current Nagios
> dashboard, it monitors 4612 services on 88 hosts (which is interesting
> considering LDAP thinks there are 78). That looks terrifying, but it's
> actually a set of 9 commands running on the Nagios server, including the
> complex `check_nrpe` system, which is basically a client-side nagios that
> has its own set of checks. And that's where the "cardinal explosion"
> happens: on a typical host, there are 315 such checks implemented.
>
> That's the hard part: convert those 324 checks into Prometheus alerts,
> one at a time. Unfortunately, there are no "built-in" or even "third-
> party" "prometheus alert sets" that I could find in my
> [https://anarc.at/blog/2018-01-17-monitoring-prometheus/ original
> research], although that might have changed in the last year.
>
> Each check in Prometheus is basically a YAML file describing a Prometheus
> query that, when it evaluates to "true" (e.g. disk_space > 90%), sends an
> alert. It's not impossible to do that conversion, it's just a lot of
> work.
>
> To do this progressively while allowing us to make new alerts on
> Prometheus instead of Nagios, I suggest to proceed the same way
> Cloudflare did, which is to establish a "Nagios to Prometheus" bridge, by
> which Nagios doesn't send the alerts on its own and instead forwards them
> to the Prometheus server, a plugin they called
> [https://github.com/cloudflare/promsaint Promsaint].
>
> With the bridge in place, Nagios checks can be migrated into Prometheus
> alerts progressively without disruption. Note that Cloudflare documented
> their experience with Prometheus in [https://promcon.io/2017-munich/talks
> /monitoring-cloudflares-planet-scale-edge-network-with-prometheus/ this
> 2017 promcon talk]. Cloudflare also made an alert dashboard called
> [https://github.com/cloudflare/unsee unsee] (see also the fork called
> [https://github.com/prymitive/karma karma]) and
> [https://github.com/cloudflare/alertmanager2es elasticsearch integration]
> which might be good to investigate further.
>
> Another useful piece is this [https://www.robustperception.io/nagios-
> nrpe-prometheus-exporter NRPE to Prometheus exporter], which allows
> Prometheus to directly scrape NRPE targets. It doesn't include Prometheus
> alerts and instead relies on a Grafana dashboard to show possible
> problems so, as such, I don't think it's that useful an alternative.
>
> So, battle plan is basically this:
>
>  1. `apt install prometheus-alertmanager`
>  2. reimplement the Nagios alerting commands
>  3. send Nagios alerts through the alertmanager
>  4. rewrite (non-NRPE) commands (9) as Prometheus alerts
>  5. optionnally, scrape the NRPE metrics from Prometheus
>  6. optionnally, create a dashboard and/or alerts for the NRPE metrics
>  7. rewrite NRPE commands (300+) as Prometheus alerts
>  8. turn off the Nagios server
>  9. remove all traces of NRPE on all nodes

New description:

 As a followup to the Prometheus/Grafana setup started in #29681, I am
 wondering if we should also consider replacing the Nagios/Icinga server
 with Prometheus. I have done a little research on the subject and figured
 it might be good to at least document the current state of affairs.

 This would remove a complex piece of architecture we have at TPO that was
 designed before Puppet was properly deployed. Prometheus has an
 interesting federated design that allows it to scale to multiple machines
 easily, along with a high availability component for the alertmanager that
 allows it to be more reliable than a traditionnal Nagios configuration. It
 would also simplify our architecture as the Nagios server automation is a
 complex mix of Debian packages and git hooks that is serving us well, but
 hard to comprehend and debug for new administrators. (I managed to wipe
 the entire Nagios config myself on my first week on the job by messing up
 a configuration file.) Having the monitoring server fully deployed by
 Puppet would be a huge improvement, even if it would be done with Nagios
 instead of Prometheus, of course.

 Right now the Nagios server is actually running Icinga 1.13, a Nagios
 fork, on a heztner machine (`hetzner-hel1-01`). It's doing its job
 generally well although it feels a *little* noisy, but that's to be
 expected form Nagios servers. Reducing the number of alerts seems to be an
 objective, explicitely documented in #29410, for example.

 Both Grafana and Prometheus can do alerting, with various mechanisms and
 plugins. I haven't investigated those deeply, but in general that's not a
 problem in alerting: you fire some script or API and the rest happens. I
 suspect we could port the current Nagios alerting scripts to Prometheus
 fairly easily, although I haven't investigated our scripts in details.

 The problem is reproducing the check scripts and their associated alert
 threshold. In the Nagios world, when a check is installed, it *comes* with
 its own health ("OK", "WARNING", "CRITICAL") threshold and TPO has
 developed a wide variety of such checks. According to the current Nagios
 dashboard, it monitors 4612 services on 88 hosts (which is interesting
 considering LDAP thinks there are 78). That looks terrifying, but it's
 actually a set of 9 commands running on the Nagios server, including the
 complex `check_nrpe` system, which is basically a client-side nagios that
 has its own set of checks. And that's where the "cardinal explosion"
 happens: on a typical host, there are 315 such checks implemented.

 That's the hard part: convert those 324 checks into Prometheus alerts, one
 at a time. Unfortunately, there are no "built-in" or even "third-party"
 "prometheus alert sets" that I could find in my
 [https://anarc.at/blog/2018-01-17-monitoring-prometheus/ original
 research], although that might have changed in the last year.

 Each check in Prometheus is basically a YAML file describing a Prometheus
 query that, when it evaluates to "true" (e.g. disk_space > 90%), sends an
 alert. It's not impossible to do that conversion, it's just a lot of work.

 To do this progressively while allowing us to make new alerts on
 Prometheus instead of Nagios, I suggest to proceed the same way Cloudflare
 did, which is to establish a "Nagios to Prometheus" bridge, by which
 Nagios doesn't send the alerts on its own and instead forwards them to the
 Prometheus server, a plugin they called
 [https://github.com/cloudflare/promsaint Promsaint].

 With the bridge in place, Nagios checks can be migrated into Prometheus
 alerts progressively without disruption. Note that Cloudflare documented
 their experience with Prometheus in [https://promcon.io/2017-munich/talks
 /monitoring-cloudflares-planet-scale-edge-network-with-prometheus/ this
 2017 promcon talk]. Cloudflare also made an alert dashboard called
 [https://github.com/cloudflare/unsee unsee] (see also the fork called
 [https://github.com/prymitive/karma karma]) and
 [https://github.com/cloudflare/alertmanager2es elasticsearch integration]
 which might be good to investigate further.

 Another useful piece is this [https://www.robustperception.io/nagios-nrpe-
 prometheus-exporter NRPE to Prometheus exporter], which allows Prometheus
 to directly scrape NRPE targets. It doesn't include Prometheus alerts and
 instead relies on a Grafana dashboard to show possible problems so, as
 such, I don't think it's that useful an alternative. There's a
 [https://github.com/m-lab/prometheus-nagios-exporter similar approach
 using check_mk] instead.

 Another possible approach is to send alerts from Nagios based on
 Prometheus checks, using the [https://github.com/prometheus/nagios_plugins
 Prometheus nagios plugins]. This might allow us to get rid of NRPE
 everywhere but it would probably be useful only if we do want to keep
 Nagios in the long term and remove NRPE in favor of the existing
 Prometheus exporters.

 So, battle plan is basically this:

  1. `apt install prometheus-alertmanager`
  2. reimplement the Nagios alerting commands
  3. send Nagios alerts through the alertmanager
  4. rewrite (non-NRPE) commands (9) as Prometheus alerts
  5. optionnally, scrape the NRPE metrics from Prometheus
  6. optionnally, create a dashboard and/or alerts for the NRPE metrics
  7. rewrite NRPE commands (300+) as Prometheus alerts
  8. turn off the Nagios server
  9. remove all traces of NRPE on all nodes

--

--
Ticket URL: <https://trac.torproject.org/projects/tor/ticket/29864#comment:2>
Tor Bug Tracker & Wiki <https://trac.torproject.org/>
The Tor Project: anonymity online