[tor-bugs] #33406 [Internal Services/Tor Sysadmin Team]: automate reboots

Tor Bug Tracker & Wiki blackhole at torproject.org
Fri May 1 00:04:19 UTC 2020


#33406: automate reboots
-------------------------------------------------+-------------------------
 Reporter:  anarcat                              |          Owner:  anarcat
     Type:  project                              |         Status:
                                                 |  accepted
 Priority:  Low                                  |      Milestone:
Component:  Internal Services/Tor Sysadmin Team  |        Version:
 Severity:  Major                                |     Resolution:
 Keywords:  tpa-roadmap-may                      |  Actual Points:
Parent ID:                                       |         Points:
 Reviewer:                                       |        Sponsor:
-------------------------------------------------+-------------------------
Changes (by anarcat):

 * keywords:  tpa-roadmap-april => tpa-roadmap-may


Comment:

 i fixed the timeout error, and did today's round of upgrades without too
 many problems. one issue that came up is that ganeti wasn't happy to
 chain-reboot machines: some instances had to have a `activate-disks` ran
 so they recognize their secondary. that has been added as a TODO in the
 code.

 i also made some experiments with feeding LDAP hosts lists as an argument
 to the reboot command which also worked well. this, for example, rebooted
 the `rotation` hosts with a 10-minute delay:

 {{{
 ./reboot -H $(ssh alberti.torproject.org 'ldapsearch -h db.torproject.org
 -x -ZZ -b dc=torproject,dc=org -LLL
 "(&(hostname=*.torproject.org)(rebootPolicy=rotation))" hostname | awk
 "\$1 == \"hostname:\" {print \$2}" | sort') -v
 }}}

 I added a modified recipe to the upgrades page, which covers all cases.

 I also set the reboot policy on a few hosts so they are classified
 properly, those didn't have a policy, and now have:

 manual:

 * moly (KVM, requires special handling)
 * kvm4 (KVM)
 * kvm5 (KVM)
 * scw-arm-par1 (buggy buildbox, see #32920)
 * fsn-node-01 (ganeti, requires special handling)
 * fsn-node-02 (ganeti)
 * fsn-node-03 (ganeti)
 * weissii (windows buildbox, no ssh)
 * woronowii (windows buildbox, no ssh)
 * winklerianum (windows buildbox, no ssh)

 justdoit:

 * pauli (puppet)
 * rude (rt)
 * alberti (ldap)
 * eugeni (mail)
 * majus (translation)
 * rouyi (jenkins)
 * troodi (trac)
 * nevii (dns primary)
 * henryi (consensus-health)
 * vineale (gitweb)
 * gayi (svn)
 * polyanthum (bridges)
 * materculae (exonerator)
 * meronense (metrics.tpo)
 * colchicifolium (collector backend)
 * carinatum (DocTor)
 * build-x86-05 (buildbox)
 * build-x86-06 (buildbox)
 * build-x86-08 (buildbox)
 * build-x86-09 (buildbox)
 * perdulce (people.tpo)
 * staticiforme (static master)
 * forrestii (fpcentral)
 * subnotabile (survey)
 * crm-int-01 (CRM backend)
 * crm-ext-01 (CRM frontend)
 * submit-01 (mail)

 rotation:

 * fallax (DNS secondary)
 * omeiense (onionoo backend)
 * oo-hetzner-03 (onionoo backend)
 * neriniflorum (DNS secondary)
 * web-hetzner-01 (web frontend)
 * web-cymru-01 (web frontend)


 the following were already configured as...

 rotation:

 * orestis (onionoo backend)
 * nutans (DNS secondary)
 * cdn-backend-sunet-01 (web frontend)
 * hetzner-hel1-02 (DNS secondary)
 * hetzner-hel1-03 (web frontend)
 * onionoo-backend-01 (onionoo backend)
 * web-fsn-01 (web frontend)
 * web-fsn-02 (web frontend)
 * onionoo-frontend-01 (onionoo frontend)
 * cache01 (cache frontend)
 * cache-02 (cache frontend)
 * onionoo-backend-02 (onionoo backend)

 justdoit:

 * corsicum (collector)
 * hetzner-hel1-01 (nagios)
 * bungei (backup storage)
 * hetzner-nbg1-01 (prometheus)
 * hetzner-nbg1-02 (prometheus)
 * archive-01 (non-redundant web frontend)
 * loghost01 (syslog)
 * static-master-fsn (static master)
 * bacula-director-01 (backup director)
 * gettor-01 (gettor)
 * onionbalance-01 (onionbalance)
 * chives (IRC)
 * build-arm-10 (buildbox)
 * tbb-nightlies-master (static master)
 * gitlab-02 (gitlab)
 * check-01 (check.tpo)

 manual:

 * mandos-01 (mandos, requires crypto)
 * fsn-node-04
 * fsn-node-05

 In other words, I made the following diff in LDAP:

 {{{
 --- policy-before       2020-04-30 19:48:50.158412413 -0400
 +++ policy-after        2020-04-30 19:54:15.209832522 -0400
 @@ -6,27 +6,35 @@

  dn: host=moly,ou=hosts,dc=torproject,dc=org
  host: moly
 +rebootPolicy: manual

  dn: host=pauli,ou=hosts,dc=torproject,dc=org
  host: pauli
 +rebootPolicy: justdoit

  dn: host=rude,ou=hosts,dc=torproject,dc=org
  host: rude
 +rebootPolicy: justdoit

  dn: host=alberti,ou=hosts,dc=torproject,dc=org
  host: alberti
 +rebootPolicy: justdoit

  dn: host=cupani,ou=hosts,dc=torproject,dc=org
  host: cupani
 +rebootPolicy: justdoit

  dn: host=fallax,ou=hosts,dc=torproject,dc=org
  host: fallax
 +rebootPolicy: rotation

  dn: host=eugeni,ou=hosts,dc=torproject,dc=org
  host: eugeni
 +rebootPolicy: justdoit

  dn: host=majus,ou=hosts,dc=torproject,dc=org
  host: majus
 +rebootPolicy: justdoit

  dn: host=listera,ou=hosts,dc=torproject,dc=org
  host: listera
 @@ -34,63 +42,83 @@

  dn: host=rouyi,ou=hosts,dc=torproject,dc=org
  host: rouyi
 +rebootPolicy: justdoit

  dn: host=palmeri,ou=hosts,dc=torproject,dc=org
  host: palmeri
 +rebootPolicy: justdoit

  dn: host=weissii,ou=hosts,dc=torproject,dc=org
  host: weissii
 +rebootPolicy: manual

  dn: host=troodi,ou=hosts,dc=torproject,dc=org
  host: troodi
 +rebootPolicy: justdoit

  dn: host=nevii,ou=hosts,dc=torproject,dc=org
  host: nevii
 +rebootPolicy: justdoit

  dn: host=henryi,ou=hosts,dc=torproject,dc=org
  host: henryi
 +rebootPolicy: justdoit

  dn: host=vineale,ou=hosts,dc=torproject,dc=org
  host: vineale
 +rebootPolicy: justdoit

  dn: host=gayi,ou=hosts,dc=torproject,dc=org
  host: gayi
 +rebootPolicy: justdoit

  dn: host=polyanthum,ou=hosts,dc=torproject,dc=org
  host: polyanthum
 +rebootPolicy: justdoit

  dn: host=materculae,ou=hosts,dc=torproject,dc=org
  host: materculae
 +rebootPolicy: justdoit

  dn: host=omeiense,ou=hosts,dc=torproject,dc=org
  host: omeiense
 +rebootPolicy: rotation

  dn: host=meronense,ou=hosts,dc=torproject,dc=org
  host: meronense
 +rebootPolicy: justdoit

  dn: host=colchicifolium,ou=hosts,dc=torproject,dc=org
  host: colchicifolium
 +rebootPolicy: justdoit

  dn: host=carinatum,ou=hosts,dc=torproject,dc=org
  host: carinatum
 +rebootPolicy: justdoit

  dn: host=build-x86-05,ou=hosts,dc=torproject,dc=org
  host: build-x86-05
 +rebootPolicy: justdoit

  dn: host=build-x86-06,ou=hosts,dc=torproject,dc=org
  host: build-x86-06
 +rebootPolicy: justdoit

  dn: host=perdulce,ou=hosts,dc=torproject,dc=org
  host: perdulce
 +rebootPolicy: justdoit

  dn: host=staticiforme,ou=hosts,dc=torproject,dc=org
  host: staticiforme
 +rebootPolicy: justdoit

  dn: host=woronowii,ou=hosts,dc=torproject,dc=org
  host: woronowii
 +rebootPolicy: manual

  dn: host=winklerianum,ou=hosts,dc=torproject,dc=org
  host: winklerianum
 +rebootPolicy: manual

  dn: host=orestis,ou=hosts,dc=torproject,dc=org
  host: orestis
 @@ -106,21 +134,27 @@

  dn: host=kvm4,ou=hosts,dc=torproject,dc=org
  host: kvm4
 +rebootPolicy: manual

  dn: host=oo-hetzner-03,ou=hosts,dc=torproject,dc=org
  host: oo-hetzner-03
 +rebootPolicy: rotation

  dn: host=forrestii,ou=hosts,dc=torproject,dc=org
  host: forrestii
 +rebootPolicy: justdoit

  dn: host=subnotabile,ou=hosts,dc=torproject,dc=org
  host: subnotabile
 +rebootPolicy: justdoit

  dn: host=kvm5,ou=hosts,dc=torproject,dc=org
  host: kvm5
 +rebootPolicy: manual

  dn: host=neriniflorum,ou=hosts,dc=torproject,dc=org
  host: neriniflorum
 +rebootPolicy: rotation

  dn: host=hetzner-hel1-01,ou=hosts,dc=torproject,dc=org
  host: hetzner-hel1-01
 @@ -132,12 +166,15 @@

  dn: host=build-x86-08,ou=hosts,dc=torproject,dc=org
  host: build-x86-08
 +rebootPolicy: justdoit

  dn: host=web-hetzner-01,ou=hosts,dc=torproject,dc=org
  host: web-hetzner-01
 +rebootPolicy: rotation

  dn: host=scw-arm-par-01,ou=hosts,dc=torproject,dc=org
  host: scw-arm-par-01
 +rebootPolicy: manual

  dn: host=hetzner-hel1-02,ou=hosts,dc=torproject,dc=org
  host: hetzner-hel1-02
 @@ -149,15 +186,19 @@

  dn: host=web-cymru-01,ou=hosts,dc=torproject,dc=org
  host: web-cymru-01
 +rebootPolicy: rotation

  dn: host=crm-int-01,ou=hosts,dc=torproject,dc=org
  host: crm-int-01
 +rebootPolicy: justdoit

  dn: host=crm-ext-01,ou=hosts,dc=torproject,dc=org
  host: crm-ext-01
 +rebootPolicy: justdoit

  dn: host=build-x86-09,ou=hosts,dc=torproject,dc=org
  host: build-x86-09
 +rebootPolicy: justdoit

  dn: host=bungei,ou=hosts,dc=torproject,dc=org
  host: bungei
 @@ -181,9 +222,11 @@

  dn: host=fsn-node-01,ou=hosts,dc=torproject,dc=org
  host: fsn-node-01
 +rebootPolicy: manual

  dn: host=fsn-node-02,ou=hosts,dc=torproject,dc=org
  host: fsn-node-02
 +rebootPolicy: manual

  dn: host=loghost01.torproject.org,ou=hosts,dc=torproject,dc=org
  host: loghost01
 @@ -243,6 +286,7 @@

  dn: host=fsn-node-03,ou=hosts,dc=torproject,dc=org
  host: fsn-node-03
 +rebootPolicy: manual

  dn: host=onionoo-backend-02,ou=hosts,dc=torproject,dc=org
  host: onionoo-backend-02
 }}}

 The policy is being interpreted here as:

  * manual: requires manual intervention or special tools (fabric in case
 of ganeti, reboot-host in the case of KVM, nothing for windows boxes)
  * justdoit: can be rebooted with proper prior warning (10 minutes),
 possibly in parallel with each other
  * rotation: must not be rebooted together, longer warning (30 minutes)

 I tried to update the "upgrades" docs to reflect this.

 I think the last steps here are:

  1. add LDAP support in the reboot script
  2. parallelize "justdoit" jobs
  3. turn ganeti hosts into "rotation" once we officialize this new
 procedure

 This is therefore likely to be completed in may.

--
Ticket URL: <https://trac.torproject.org/projects/tor/ticket/33406#comment:11>
Tor Bug Tracker & Wiki <https://trac.torproject.org/>
The Tor Project: anonymity online


More information about the tor-bugs mailing list