[tor-bugs] #32801 [- Select a component]: major outage: kvm4 down, affected: eugeni (mail, lists), alberti (ldap), pauli (puppet), rouyi (jenkins), etc

Tor Bug Tracker & Wiki blackhole at torproject.org
Wed Dec 18 21:51:16 UTC 2019


#32801: major outage: kvm4 down, affected: eugeni (mail, lists), alberti (ldap),
pauli (puppet), rouyi (jenkins), etc
--------------------------------------+--------------------
     Reporter:  anarcat               |      Owner:  (none)
         Type:  defect                |     Status:  new
     Priority:  Medium                |  Milestone:
    Component:  - Select a component  |    Version:
     Severity:  Normal                |   Keywords:
Actual Points:                        |  Parent ID:
       Points:                        |   Reviewer:
      Sponsor:                        |
--------------------------------------+--------------------
 During a security reboot today, kvm4.torproject.org did not return. All
 virtual machines on this host are down and unavailable.

 According to the Nextcloud spreadsheet (since LDAP is down), that
 includes:

 || host           || service                 || impact || mitigation ||
 || alberti        || LDAP, db.torproject.org || critical, no password
 change possible || read-only copies everywhere ||
 || build-x86-09   || buildbox                || redundant || N/A ||
 || eugeni         || incoming mail, lists    || critical, total outage ||
 peek at `tor-puppet/modules/postfix/files/virtual` and email people
 directly ||
 || meronense      || metrics?                || unclear || ? ||
 || neriniflorum   || DNS                     || redundant, possible
 reduction in TTFB || possible to remove from rotation ||
 || oo-hetzner-03  || onionoo                 || redundant? unclear? || ?
 ||
 || pauli          || puppet                  || major, freezes
 configuration management changes || use `cumin`, local git copies ||
 || rouyi          || jenkins                 || critical, total outage,
 affects all builds || ? ||
 || web-hetzner-01 || web mirror              || redundant, no effect? ||
 removed from rotation automatically ||
 || weissi         || build box               || no windows builds || N/A
 ||
 || woronowii      || build box               || no windows builds || N/A
 ||

 I'll note that it seems both windows build boxes are on the same
 machine so even if jenkins *would* be able to dispatch builds, we
 wouldn't be able to do those...

 A ticket was filed with Hetzner to try and rescue the server.

 Our disaster recover plan so far is to wait for that rescue to succeed,
 which might take up to 24h but hopefully less.

 If that fails, I would suggest the following plan:

  1. recover eugeni, pauli, alberti from backups on gnt-fsn or elsewhere
 (we need those three to build new machines)
  2. build a new ganeti cluster (because we can't recover all of this on
 gnt-fsn)
  3. restore remaining machines on the new cluster
  4. decommission kvm4 officially

 This could take a few days of work. :(

--
Ticket URL: <https://trac.torproject.org/projects/tor/ticket/32801>
Tor Bug Tracker & Wiki <https://trac.torproject.org/>
The Tor Project: anonymity online


More information about the tor-bugs mailing list