[tor-bugs] #33958 [Internal Services/Tor Sysadmin Team]: fsn VMs lost connectivity this morning

Tor Bug Tracker & Wiki blackhole at torproject.org
Wed Apr 22 07:42:34 UTC 2020


#33958: fsn VMs lost connectivity this morning
-----------------------------------------------------+-----------------
     Reporter:  weasel                               |      Owner:  tpa
         Type:  defect                               |     Status:  new
     Priority:  High                                 |  Milestone:
    Component:  Internal Services/Tor Sysadmin Team  |    Version:
     Severity:  Major                                |   Keywords:
Actual Points:                                       |  Parent ID:
       Points:                                       |   Reviewer:
      Sponsor:                                       |
-----------------------------------------------------+-----------------
 This morning several of our VMs at fsn were without network.

 The instances were still running, and `gnt-console` still got me a console
 that I could log into, but the machines were not reachable from the
 network, nor could they reach the network.  tcpdumping the bridge
 interface on the node did not show any network traffic for the instance.

 Migrating them made them be online again (tried with vineale for
 instance).  Rebooting also helped (tried with everything else).

 Looking at the running openswitch config on a node when its instances did
 not have network looked like this:
 {{{
 root at fsn-node-04:~# ovs-vsctl show
 ce[...]
     Bridge "br0"
         Port vlan-gntinet
             tag: 4000
             Interface vlan-gntinet
                 type: internal
         Port "eth0"
             Interface "eth0"
         Port "br0"
             Interface "br0"
                 type: internal
         Port vlan-gntbe
             tag: 4001
             Interface vlan-gntbe
                 type: internal
     ovs_version: "2.10.1"
 }}}

 When its working, it should look more like this:
 {{{
 root at fsn-node-04:~# ovs-vsctl show
 ce[...]
     Bridge "br0"
         Port "tap3"
             tag: 4000
             trunks: [4000]
             Interface "tap3"
         Port vlan-gntinet
             tag: 4000
             Interface vlan-gntinet
                 type: internal
         Port "eth0"
             Interface "eth0"
         Port "tap4"
             tag: 4000
             trunks: [4000]
             Interface "tap4"
         Port "br0"
             Interface "br0"
                 type: internal
         Port "tap5"
             tag: 4000
             trunks: [4000]
             Interface "tap5"
         Port "tap1"
             tag: 4000
             trunks: [4000]
             Interface "tap1"
         Port vlan-gntbe
             tag: 4001
             Interface vlan-gntbe
                 type: internal
         Port "tap2"
             tag: 4000
             trunks: [4000]
             Interface "tap2"
         Port "tap0"
             tag: 4000
             trunks: [4000]
             Interface "tap0"
     ovs_version: "2.10.1"
 }}}

 My first guess was that migrating somehow had screwed up the network
 config, but that's probably not what happened, as the issue happened again
 shortly afterwards when I was running upgrades.  So:

 My current working theory is that the following happened:
  - In the morning, once automaticallly and once manually, we ran package
 upgrades.
  - Today this included an openssl update.  And openvswitch is linked
 against openssl.
  - `needrestart` restarted openvswitch.
  - restarting openvswitch does not restore the dynamically added VM taps
 into the bridge.

 I propose we blacklist openvswitch from being restarted by needrestart.

--
Ticket URL: <https://trac.torproject.org/projects/tor/ticket/33958>
Tor Bug Tracker & Wiki <https://trac.torproject.org/>
The Tor Project: anonymity online


More information about the tor-bugs mailing list