[tor-bugs] #33406 [Internal Services/Tor Sysadmin Team]: automate reboots

Tor Bug Tracker & Wiki blackhole at torproject.org
Thu Apr 2 22:53:36 UTC 2020


#33406: automate reboots
-------------------------------------------------+---------------------
 Reporter:  anarcat                              |          Owner:  tpa
     Type:  project                              |         Status:  new
 Priority:  Low                                  |      Milestone:
Component:  Internal Services/Tor Sysadmin Team  |        Version:
 Severity:  Major                                |     Resolution:
 Keywords:  tpa-roadmap-march                    |  Actual Points:
Parent ID:                                       |         Points:
 Reviewer:                                       |        Sponsor:
-------------------------------------------------+---------------------

Comment (by anarcat):

 i did more work on the reboot procedures today, and rebooted the ganeti
 cluster using the reboot command. there were some issues with the initrd
 interfering with the `wait_for_boot` (now called `wait_for_ping`) checks
 so I did some refactoring, but i'm still confused about the exception
 that's raised by Fabric in this case.

 the exception I got here is:

 {{{
     All instances migrated successfully.
     Shutdown scheduled for Thu 2020-04-02 18:30:55 UTC, use 'shutdown -c'
 to cancel.
     waiting 0 minutes for reboot to happen
     waiting up to 30 seconds for host to go down
     waiting 300 seconds for host to go up
     host fsn-node-01.torproject.org should be back online, checking uptime
     Traceback (most recent call last):
       File "./reboot", line 132, in <module>
         logging.getLogger(mod).setLevel('WARNING')
       File "./reboot", line 116, in main
         delay_up=args.delay_up,
       File "/usr/lib/python3/dist-packages/invoke/tasks.py", line 127, in
 __call__
         result = self.body(*args, **kwargs)
       File "/home/anarcat/src/tor/tsa-misc/fabric_tpa/reboot.py", line
 197, in shutdown_and_wait
         res = con.run('uptime', watchers=[responder], pty=True, warn=True)
       File "<decorator-gen-3>", line 2, in run
       File "/usr/lib/python3/dist-packages/fabric/connection.py", line 29,
 in opens
         self.open()
       File "/home/anarcat/src/tor/tsa-misc/fabric_tpa/__init__.py", line
 106, in safe_open
         Connection.open_orig(self)
       File "/usr/lib/python3/dist-packages/fabric/connection.py", line
 634, in open
         self.client.connect(**kwargs)
       File "/usr/lib/python3/dist-packages/paramiko/client.py", line 349,
 in connect
         retry_on_signal(lambda: sock.connect(addr))
       File "/usr/lib/python3/dist-packages/paramiko/util.py", line 280, in
 retry_on_signal
         return function()
       File "/usr/lib/python3/dist-packages/paramiko/client.py", line 349,
 in <lambda>
         retry_on_signal(lambda: sock.connect(addr))
     TimeoutError: [Errno 110] Connection timed out
 }}}

 maybe the exception gets generated *above* our code, in the fabric task
 handler itself, in which case it might mean we shouldn't use a @task for
 this at all, at least in our code.

--
Ticket URL: <https://trac.torproject.org/projects/tor/ticket/33406#comment:9>
Tor Bug Tracker & Wiki <https://trac.torproject.org/>
The Tor Project: anonymity online


More information about the tor-bugs mailing list