[tor-bugs] #31244 [Internal Services/Tor Sysadmin Team]: long term prometheus metrics

Tue Oct 22 17:55:33 UTC 2019

#31244: long term prometheus metrics
-------------------------------------------------+-------------------------
 Reporter:  anarcat                              |          Owner:  anarcat
     Type:  enhancement                          |         Status:
                                                 |  assigned
 Priority:  Medium                               |      Milestone:
Component:  Internal Services/Tor Sysadmin Team  |        Version:
 Severity:  Normal                               |     Resolution:
 Keywords:                                       |  Actual Points:
Parent ID:                                       |         Points:
 Reviewer:                                       |        Sponsor:
-------------------------------------------------+-------------------------
Changes (by anarcat):

 * owner:  tpa => anarcat
 * status:  new => assigned

Comment:

 i've decided to postpone the creation of a secondary server and instead
 change the retention period on the current server to see if it fixes
 reliability issues detailed in #31916. if, in 30 days, we still have this
 problem, then we can setup a secondary to see if we can reproduce the
 problem there. after all, we don't need a redundant setup as long as we
 don't do alerting, for which we still use nagios (#29864). see also the
 commit log for more details:

 {{{
 origin/master 7cda3928fe9c6bf83ee3e8977b74d58acbb7519a
 Author:     Antoine Beaupré <anarcat at debian.org>
 AuthorDate: Tue Oct 22 13:46:05 2019 -0400
 Commit:     Antoine Beaupré <anarcat at debian.org>
 CommitDate: Tue Oct 22 13:46:05 2019 -0400

 Parent:     91e379a5 make all mpm_worker paramaters configurable
 Merged:     master sudo-ldap
 Contained:  master

 downgrade scrape interval on internal prometheus server (#31916)

 This is an attempt at fixing the reliability issues on the prometheus
 server detailed in #31916. The current theory is that ipsec might be
 the culprit, but it's also possible that the prometheus is overloaded
 and that's creating all sorts of other, unrelated problems.

 This is sidetracking the setup of a *separate* long term monitoring
 server (#31244), of course, but I'm not sure that's really necessary
 for now. Since we don't use prometheus for alerting (#29864), we don't
 absolutely /need/ redundancy here so we can afford a SPOF for
 Prometheus while we figure out this bug.

 If, in thirday days, we still have reliability problems, we will know
 this is not due to the retention period and can cycle back to the
 other solutions, including creating a secondary server to see if it
 reproduces the problem.

 1 file changed, 2 insertions(+), 1 deletion(-)
 modules/profile/manifests/prometheus/server/internal.pp | 3 ++-

 modified   modules/profile/manifests/prometheus/server/internal.pp
 @@ -42,7 +42,8 @@ class profile::prometheus::server::internal (
      vhost_name          => $vhost_name,
      collect_scrape_jobs => $collect_scrape_jobs,
      scrape_configs      => $scrape_configs,
 -    storage_retention   => '30d',
 +    storage_retention   => '365d',
 +    scrape_interval     => '5m',
    }
    # expose our IP address to exporters so they can allow us in
    #

 }}}

--
Ticket URL: <https://trac.torproject.org/projects/tor/ticket/31244#comment:5>
Tor Bug Tracker & Wiki <https://trac.torproject.org/>
The Tor Project: anonymity online