[tor-bugs] #33406 [Internal Services/Tor Sysadmin Team]: automate reboots

Tor Bug Tracker & Wiki blackhole at torproject.org
Fri Mar 13 19:21:24 UTC 2020


#33406: automate reboots
-------------------------------------------------+---------------------
 Reporter:  anarcat                              |          Owner:  tpa
     Type:  project                              |         Status:  new
 Priority:  Low                                  |      Milestone:
Component:  Internal Services/Tor Sysadmin Team  |        Version:
 Severity:  Major                                |     Resolution:
 Keywords:  tpa-roadmap-march                    |  Actual Points:
Parent ID:                                       |         Points:
 Reviewer:                                       |        Sponsor:
-------------------------------------------------+---------------------

Old description:

> in #31957 we have worked on automating upgrades, but that's only part of
> the problem. we also need to reboot in some situations.
>
> we have various mechanisms to do so right now:
>
>  * `tsa-misc/reboot-host` - reboot script for kvm boxes, kind of a mess,
> to be removed when we finish the kvm-ganeti migration
>  * `tsa-misc/reboot-guest` - reboot a single host. kind of a hack, but
> useful to reboot a single machine
>  * `misc/multi-tool/torproject-reboot-simple` - iterate over all hosts
> with `rebootPolicy=justdoit` in LDAP and reboot them with `torproject-
> reboot-many`
>  * `misc/multi-tool/torproject-reboot-simple` - iterate over all hosts
> with `rebootPolicy=rotation` in LDAP and reboot them with `torproject-
> reboot-many`, with a 30 minute delay between each host
>  * `ganeti-reboot-cluster` - a tool to reboot the ganeti cluster
>
> There are various problems with all this:
>
>  * the `torproject-reboot-*` scripts do not take care of
> `rebootPolicy=manual` hosts
>  * the `ganeti-reboot-cluster` script has been known to fail if a cluster
> is unbalanced
>  * the `ganeti-reboot-cluster` script currently fails when hosts talk to
> each other over IPv6 somehow (see #33412)
>  * we have 5 different ways of performing reboots, we should have just
> one script that does it all
>  * reboot-{host,guest} do not check if hosts need reboot before rebooting
> (but the multi-tool does)
>
> In short, this is kind of a mess, and we should refactor this. We should
> consider using needrestart, which knows how to reboot individual hosts.
>
> I also added a [https://github.com/xneelo/hetzner-needrestart/issues/23
> feature request to the needrestart puppet module] to expose its knowledge
> as a puppet fact, so we can use that information from PuppetDB instead of
> SSH'ing in each host and calling the dsa-* tools.

New description:

 in #31957 we have worked on automating upgrades, but that's only part of
 the problem. we also need to reboot in some situations.

 we have various mechanisms to do so right now:

  * `tsa-misc/reboot-host` - reboot script for kvm boxes, kind of a mess,
 to be removed when we finish the kvm-ganeti migration
  * `tsa-misc/reboot-guest` - reboot a single host. kind of a hack, but
 useful to reboot a single machine
  * `misc/multi-tool/torproject-reboot-simple` - iterate over all hosts
 with `rebootPolicy=justdoit` in LDAP and reboot them with `torproject-
 reboot-many`
  * `misc/multi-tool/torproject-reboot-rotation` - iterate over all hosts
 with `rebootPolicy=rotation` in LDAP and reboot them with `torproject-
 reboot-many`, with a 30 minute delay between each host
  * `ganeti-reboot-cluster` - a tool to reboot the ganeti cluster

 There are various problems with all this:

  * the `torproject-reboot-*` scripts do not take care of
 `rebootPolicy=manual` hosts
  * the `ganeti-reboot-cluster` script has been known to fail if a cluster
 is unbalanced
  * the `ganeti-reboot-cluster` script currently fails when hosts talk to
 each other over IPv6 somehow (see #33412)
  * we have 5 different ways of performing reboots, we should have just one
 script that does it all
  * reboot-{host,guest} do not check if hosts need reboot before rebooting
 (but the multi-tool does)

 In short, this is kind of a mess, and we should refactor this. We should
 consider using needrestart, which knows how to reboot individual hosts.

 I also added a [https://github.com/xneelo/hetzner-needrestart/issues/23
 feature request to the needrestart puppet module] to expose its knowledge
 as a puppet fact, so we can use that information from PuppetDB instead of
 SSH'ing in each host and calling the dsa-* tools.

--

Comment (by anarcat):

 that prototype is now a library, in https://gitweb.torproject.org/admin
 /tsa-misc.git/tree/fabric_tpa/reboot.py

 it can be called with a wrapper script in
 https://gitweb.torproject.org/admin/tsa-misc.git/tree/reboot

 with something like:

 {{{
 ./reboot -H fsn-node-03.torproject.org,...
 }}}

 it handles ganeti nodes, but not libvirt nodes. it therefore replaces the
 following:

  * `tsa-misc/reboot-guest`
  * `ganeti-reboot-cluster`

 it *could* also replace the following, provided that (a) a host list is
 somewhat generated out of band and (b) the operator stays online long
 enough for the job to complete:

  * `misc/multi-tool/torproject-reboot-simple`
  * `misc/multi-tool/torproject-reboot-rotation` - with an explicit 30
 minutes delay

 The remaining script (`tsa-misc/reboot-host`) has been marked as
 deprecated, and will be removed once we get rid of the last KVM/libvirt
 server (#33084).

 So the remaining work here is to extend the reboot script to do an
 automatic inventory of the hosts requiring a reboot and to schedule them
 according to policy.

 We also don't check if a reboot is required at all right now, and we
 should do so. All those "TODO" items are documented in the tsa-misc source
 code listed above.

--
Ticket URL: <https://trac.torproject.org/projects/tor/ticket/33406#comment:7>
Tor Bug Tracker & Wiki <https://trac.torproject.org/>
The Tor Project: anonymity online


More information about the tor-bugs mailing list