[tor-bugs] #31916 [Internal Services/Tor Sysadmin Team]: reliability issues with hetzner-nbg1-01

Tue Oct 22 17:53:35 UTC 2019

#31916: reliability issues with hetzner-nbg1-01
-------------------------------------------------+-------------------------
 Reporter:  anarcat                              |          Owner:  anarcat
     Type:  defect                               |         Status:
                                                 |  needs_review
 Priority:  Medium                               |      Milestone:
Component:  Internal Services/Tor Sysadmin Team  |        Version:
 Severity:  Blocker                              |     Resolution:
 Keywords:                                       |  Actual Points:
Parent ID:                                       |         Points:
 Reviewer:                                       |        Sponsor:
-------------------------------------------------+-------------------------
Changes (by anarcat):

 * status:  assigned => needs_review

Comment:

 as I can't figure out the network issue, i'm trying another tack. i've
 extended the scrape_interval from 15s to 5m while raising the
 retention_period from 30d to 365d. the latter shouldn't take effect for 30
 days while the former will finish converting the database within 30 days.
 if, after 30 days, we still have this problem, we'll know this is not
 because of the aggressive retention interval and we might want to consider
 setting up a secondary server (#31244) to see if it can reproduce this
 problem.

 or, as the commitlog said:

 {{{
 origin/master 7cda3928fe9c6bf83ee3e8977b74d58acbb7519a
 Author:     Antoine Beaupré <anarcat at debian.org>
 AuthorDate: Tue Oct 22 13:46:05 2019 -0400
 Commit:     Antoine Beaupré <anarcat at debian.org>
 CommitDate: Tue Oct 22 13:46:05 2019 -0400

 Parent:     91e379a5 make all mpm_worker paramaters configurable
 Merged:     master sudo-ldap
 Contained:  master

 downgrade scrape interval on internal prometheus server (#31916)

 This is an attempt at fixing the reliability issues on the prometheus
 server detailed in #31916. The current theory is that ipsec might be
 the culprit, but it's also possible that the prometheus is overloaded
 and that's creating all sorts of other, unrelated problems.

 This is sidetracking the setup of a *separate* long term monitoring
 server (#31244), of course, but I'm not sure that's really necessary
 for now. Since we don't use prometheus for alerting (#29864), we don't
 absolutely /need/ redundancy here so we can afford a SPOF for
 Prometheus while we figure out this bug.

 If, in thirday days, we still have reliability problems, we will know
 this is not due to the retention period and can cycle back to the
 other solutions, including creating a secondary server to see if it
 reproduces the problem.

 1 file changed, 2 insertions(+), 1 deletion(-)
 modules/profile/manifests/prometheus/server/internal.pp | 3 ++-

 modified   modules/profile/manifests/prometheus/server/internal.pp
 @@ -42,7 +42,8 @@ class profile::prometheus::server::internal (
      vhost_name          => $vhost_name,
      collect_scrape_jobs => $collect_scrape_jobs,
      scrape_configs      => $scrape_configs,
 -    storage_retention   => '30d',
 +    storage_retention   => '365d',
 +    scrape_interval     => '5m',
    }
    # expose our IP address to exporters so they can allow us in
    #

 }}}

--
Ticket URL: <https://trac.torproject.org/projects/tor/ticket/31916#comment:9>
Tor Bug Tracker & Wiki <https://trac.torproject.org/>
The Tor Project: anonymity online