[tor-bugs] #32802 [Internal Services/Tor Sysadmin Team]: decomission kvm4

Tor Bug Tracker & Wiki blackhole at torproject.org
Thu Dec 19 01:22:05 UTC 2019


#32802: decomission kvm4
-------------------------------------------------+---------------------
 Reporter:  anarcat                              |          Owner:  tpa
     Type:  project                              |         Status:  new
 Priority:  High                                 |      Milestone:
Component:  Internal Services/Tor Sysadmin Team  |        Version:
 Severity:  Major                                |     Resolution:
 Keywords:                                       |  Actual Points:
Parent ID:                                       |         Points:
 Reviewer:                                       |        Sponsor:
-------------------------------------------------+---------------------

Comment (by anarcat):

 here's the disaster recovery plan i made up on the fly in #32801, which is
 relevant to the discussion here:

 > According to the Nextcloud spreadsheet (since LDAP is down), [machines
 running on kvm4] includes:
 >
 > || host           || service                 || impact || mitigation ||
 > || alberti        || LDAP, db.tpo            || critical, no passwd
 change || read-only copies everywhere ||
 > || build-x86-09   || buildbox                || redundant || N/A ||
 > || eugeni         || incoming mail, lists    || critical, total outage
 || peek at `tor-puppet/modules/postfix/files/virtual` and email people
 directly ||
 > || meronense      || metrics.tpo             || critical, total outage
 || ? ||
 > || neriniflorum   || DNS                     || redundant, higher TTFB?
 || possible to remove from rotation ||
 > || oo-hetzner-03  || onionoo                 || redundant || ? ||
 > || pauli          || puppet                  || major, no config
 management || use `cumin`, local git copies ||
 > || rouyi          || jenkins                 || critical, total outage
 || ? ||
 > || web-hetzner-01 || web mirror              || redundant, no effect? ||
 removed from rotation automatically ||
 > || weissi         || build box               || no windows builds || N/A
 ||
 > || woronowii      || build box               || no windows builds || N/A
 ||
 >
 > I'll note that it seems both windows build boxes are on the same machine
 so even if jenkins *would* be able to dispatch builds, we wouldn't be able
 to do those...
 >
 > Our disaster recover plan so far is to wait for that rescue to succeed,
 which might take up to 24h but hopefully less.
 >
 > If that fails, I would suggest the following plan:
 >
 >  1. recover eugeni, pauli, alberti from backups on gnt-fsn or elsewhere
 (we need those three to build new machines)
 >  2. build a new ganeti cluster (because we can't recover all of this on
 gnt-fsn)
 >  3. restore remaining machines on the new cluster
 >  4. decommission kvm4 officially
 >
 > This could take a few days of work. :(

 Out of that, I would outline the following plan:

  1. in the short term: migrate eugeni, pauli and alberti to a HA cluster,
 probably gnt-fsn (yes, that means it will be over-allocated even more)
  2. in parallel or after (january): add a node or two to the ganeti
 cluster
  3. migrate meronense, neriniflorum, oo-hetzner-03, and rouyi to the new
 cluster

 This would leave the following boxes on kvm4, with the following
 rationale:

  * build-x86-09 - highly redundant, not urgent
  * web-hetzner-01 - one web node already present in the gnt-fsn cluster,
 moving this will not bring us more redundancy
  * weissi - hard to migrate
  * woronowii - hard to migrate

 At that point we'd have the choice to migrate the two windows VM (ugh) and
 the build box to the ganeti cluster, and we'd probably decom web-
 hetzner-01 or move it to kvm5 or some other host, then decom kvm4.

 How does that sound for a plan?

 Tickets would need to be created for each one of those tasks.

--
Ticket URL: <https://trac.torproject.org/projects/tor/ticket/32802#comment:1>
Tor Bug Tracker & Wiki <https://trac.torproject.org/>
The Tor Project: anonymity online


More information about the tor-bugs mailing list