[tor-bugs] #31916 [Internal Services/Tor Sysadmin Team]: reliability issues with hetzner-nbg1-01

Wed Oct 2 19:45:02 UTC 2019

#31916: reliability issues with hetzner-nbg1-01
-------------------------------------------------+-------------------------
 Reporter:  anarcat                              |          Owner:  anarcat
     Type:  defect                               |         Status:
                                                 |  assigned
 Priority:  Medium                               |      Milestone:
Component:  Internal Services/Tor Sysadmin Team  |        Version:
 Severity:  Blocker                              |     Resolution:
 Keywords:                                       |  Actual Points:
Parent ID:                                       |         Points:
 Reviewer:                                       |        Sponsor:
-------------------------------------------------+-------------------------

Comment (by anarcat):

 i saw a case of a downtime, live. both server couldn't ping each other for
 a few minutes. this happened in the logs, on nbg1:

 {{{
 Oct  2 19:03:20 hetzner-nbg1-01/hetzner-nbg1-01 charon[963]: 09[IKE]
 establishing CHILD_SA hetzner-hel1-01.torproject.org-hetzner-
 nbg1-01.torproject.org{6}
 Oct  2 19:03:26 hetzner-nbg1-01/hetzner-nbg1-01 charon[963]: 08[IKE]
 95.216.141.241 is initiating an IKE_SA
 Oct  2 19:03:26 hetzner-nbg1-01/hetzner-nbg1-01 charon[963]: 13[IKE]
 IKE_SA hetzner-hel1-01.torproject.org-hetzner-nbg1-01.torproject.org[54]
 established between
 195.201.139.202[195.201.139.202]...95.216.141.241[95.216.141.241]
 Oct  2 19:03:26 hetzner-nbg1-01/hetzner-nbg1-01 charon[963]: 13[IKE]
 CHILD_SA hetzner-hel1-01.torproject.org-hetzner-
 nbg1-01.torproject.org{164} established with SPIs c0bc9534_i cf59efd5_o
 and TS 195.201.139.202/32 2a01:4f8:c2c:1e17::1/128 === 95.216.141.241/32
 2a01:4f9:c010:5f1::1/128
 }}}

 ... ie. a new session, negociated in 6 seconds. that's pretty slow, but
 tolerable I guess. the problem is that, from nagios' point of view, this
 was a much longer downtime:

 {{{
 Oct  2 18:57:31 hetzner-hel1-01/hetzner-hel1-01 charon[917]: 10[IKE]
 establishing CHILD_SA hetzner-hel1-01.torproject.org-hetzner-
 nbg1-01.torproject.org{6}
 Oct  2 19:00:16 hetzner-hel1-01/hetzner-hel1-01 charon[917]: 16[IKE]
 initiating IKE_SA hetzner-hel1-01.torproject.org-hetzner-
 nbg1-01.torproject.org[44] to 195.201.139
 .202
 }}}

 ie. the session started 6 *minutes* earlier, and took 3 *more* minutes to
 get to the initiating stage. then nagios noticed the node was down,
 naturally:

 {{{
 Oct  2 19:01:11 hetzner-hel1-01/hetzner-hel1-01 icinga[1469]: SERVICE
 ALERT: hetzner-nbg1-01;process - apache2 -
 worker;CRITICAL;SOFT;1;CHECK_NRPE STATE CRITICAL: Sock
 et timeout after 50 seconds.
 Oct  2 19:01:31 hetzner-hel1-01/hetzner-hel1-01 icinga[1469]: HOST ALERT:
 hetzner-nbg1-01;DOWN;SOFT;1;PING CRITICAL - Packet loss = 100%
 }}}

 then, another three minutes later, ipsec figured it out and fixed the
 outage:

 {{{
 Oct  2 19:03:02 hetzner-hel1-01/hetzner-hel1-01 charon[917]: 12[IKE]
 initiating IKE_SA hetzner-hel1-01.torproject.org-hetzner-
 nbg1-01.torproject.org[45] to 195.201.139.202
 Oct  2 19:03:26 hetzner-hel1-01/hetzner-hel1-01 charon[917]: 14[IKE]
 establishing CHILD_SA hetzner-hel1-01.torproject.org-hetzner-
 nbg1-01.torproject.org{6}
 Oct  2 19:03:26 hetzner-hel1-01/hetzner-hel1-01 charon[917]: 08[IKE]
 IKE_SA hetzner-hel1-01.torproject.org-hetzner-nbg1-01.torproject.org[45]
 established between
 95.216.141.241[95.216.141.241]...195.201.139.202[195.201.139.202]
 Oct  2 19:03:26 hetzner-hel1-01/hetzner-hel1-01 charon[917]: 08[IKE]
 CHILD_SA hetzner-hel1-01.torproject.org-hetzner-
 nbg1-01.torproject.org{139} established with SPIs cf59efd5_i c0bc9534_o
 and TS 95.216.141.241/32 2a01:4f9:c010:5f1::1/128 === 195.201.139.202/32
 2a01:4f8:c2c:1e17::1/128
 Oct  2 19:03:41 hetzner-hel1-01/hetzner-hel1-01 icinga[1469]: HOST ALERT:
 hetzner-nbg1-01;UP;SOFT;4;PING OK - Packet loss = 0%, RTA = 26.98 ms
 }}}

 A similar problem occured when macrum rebooted. From nbg's point of view,
 it went down, for sure, but then it took forever to rekey, as it was stuck
 in this state:

 {{{
 hetzner-nbg1-01.torproject.org-macrum.torproject.org[50]: ESTABLISHED 47
 minutes ago,
 195.201.139.202[195.201.139.202]...138.201.192.11[138.201.192.11]
 hetzner-nbg1-01.torproject.org-macrum.torproject.org[50]: IKEv2 SPIs:
 c1869d7083d456cd_i adcdb88009c5736e_r*, pre-shared key reauthentication in
 117 minutes
 hetzner-nbg1-01.torproject.org-macrum.torproject.org[50]: IKE proposal:
 AES_CBC_128/HMAC_SHA2_256_128/PRF_HMAC_SHA2_256/MODP_3072
 hetzner-nbg1-01.torproject.org-macrum.torproject.org[50]: Tasks queued:
 CHILD_REKEY
 hetzner-nbg1-01.torproject.org-macrum.torproject.org[50]: Tasks active:
 CHILD_REKEY
 hetzner-nbg1-01.torproject.org-macrum.torproject.org{152}:  REKEYING,
 TUNNEL, reqid 2, expires in 12 minutes
 hetzner-nbg1-01.torproject.org-macrum.torproject.org{152}:
 195.201.139.202/32 2a01:4f8:c2c:1e17::1/128 === 138.201.192.11/32
 138.201.212.224/28 172.30.133.0/24 2a01:4f8:172:39ca::2/128
 2a01:4f8:172:39ca:0:dad3::/96
 }}}

 At that point, the tunnel had already been down for a while - macrum was
 rebooted at 19:04 - but it took a full 20 minutes for ipsec to recover:

 {{{
 Oct  2 19:18:45 hetzner-nbg1-01/hetzner-nbg1-01 charon[963]: 16[IKE]
 establishing CHILD_SA hetzner-nbg1-01.torproject.org-
 macrum.torproject.org{2}
 Oct  2 19:21:30 hetzner-nbg1-01/hetzner-nbg1-01 charon[963]: 14[IKE]
 initiating IKE_SA hetzner-nbg1-01.torproject.org-macrum.torproject.org[55]
 to 138.201.192.11
 Oct  2 19:21:30 hetzner-nbg1-01/hetzner-nbg1-01 charon[963]: 12[IKE]
 establishing CHILD_SA hetzner-nbg1-01.torproject.org-
 macrum.torproject.org{2}
 Oct  2 19:21:30 hetzner-nbg1-01/hetzner-nbg1-01 charon[963]: 10[IKE]
 IKE_SA hetzner-nbg1-01.torproject.org-macrum.torproject.org[55]
 established between
 195.201.139.202[195.201.139.202]...138.201.192.11[138.201.192.11]
 Oct  2 19:21:30 hetzner-nbg1-01/hetzner-nbg1-01 charon[963]: 10[IKE]
 CHILD_SA hetzner-nbg1-01.torproject.org-macrum.torproject.org{166}
 established with SPIs cfe1be1c_i c2ce0e71_o and TS 195.201.139.202/32
 2a01:4f8:c2c:1e17::1/128 === 138.201.192.11/32 138.201.212.224/28
 172.30.133.0/24 2a01:4f8:172:39ca::2/128 2a01:4f8:172:39ca:0:dad3::/96
 }}}

 in comparison, nagios barely flinched when the server went down:

 {{{
 Oct  2 19:04:00 hetzner-hel1-01/hetzner-hel1-01 charon[917]: 15[IKE]
 deleting IKE_SA hetzner-hel1-01.torproject.org-macrum.torproject.org[31]
 between 95.216.141.241[95.216.141.241]...138.201.192.11[138.201.192.11]
 Oct  2 19:04:00 hetzner-hel1-01/hetzner-hel1-01 charon[917]: 05[IKE]
 initiating IKE_SA hetzner-hel1-01.torproject.org-macrum.torproject.org[46]
 to 138.201.192.11
 Oct  2 19:04:31 hetzner-hel1-01/hetzner-hel1-01 icinga[1469]: HOST ALERT:
 macrum;DOWN;SOFT;1;PING CRITICAL - Packet loss = 100%
 Oct  2 19:05:21 hetzner-hel1-01/hetzner-hel1-01 icinga[1469]: SERVICE
 ALERT: macrum;unwanted process - postgrey;CRITICAL;HARD;1;CHECK_NRPE STATE
 CRITICAL: Socket timeout after 50 seconds.
 Oct  2 19:05:21 hetzner-hel1-01/hetzner-hel1-01 icinga[1469]: SERVICE
 ALERT: macrum;SSL cert - host;CRITICAL;HARD;1;CHECK_NRPE STATE CRITICAL:
 Socket timeout after 50 seconds.
 Oct  2 19:05:30 hetzner-hel1-01/hetzner-hel1-01 charon[917]: 15[IKE]
 establishing CHILD_SA hetzner-hel1-01.torproject.org-
 macrum.torproject.org{2}
 Oct  2 19:05:30 hetzner-hel1-01/hetzner-hel1-01 charon[917]: 06[IKE]
 IKE_SA hetzner-hel1-01.torproject.org-macrum.torproject.org[46]
 established between
 95.216.141.241[95.216.141.241]...138.201.192.11[138.201.192.11]
 Oct  2 19:05:30 hetzner-hel1-01/hetzner-hel1-01 charon[917]: 06[IKE]
 CHILD_SA hetzner-hel1-01.torproject.org-macrum.torproject.org{140}
 established with SPIs c93eb90f_i c8942f69_o and TS 95.216.141.241/32
 2a01:4f9:c010:5f1::1/128 === 138.201.192.11/32 138.201.212.224/28
 172.30.133.0/24 2a01:4f8:172:39ca::2/128 2a01:4f8:172:39ca:0:dad3::/96
 Oct  2 19:05:41 hetzner-hel1-01/hetzner-hel1-01 icinga[1469]: HOST ALERT:
 macrum;UP;SOFT;2;PING OK - Packet loss = 0%, RTA = 25.46 ms
 }}}

 19:04:00 is (basically) when macrum went down (syslog stopped writing at
 19:03:59), and it brought back up its strongswan service at 19:05:26, so
 it took less than 5 seconds for nagios to reconnect.

 so there's definitely something wrong with strongswan on that prometheus
 server. i've closed the ticket with hetzner as it now seems obvious they
 are not the cause of this problem.

--
Ticket URL: <https://trac.torproject.org/projects/tor/ticket/31916#comment:4>
Tor Bug Tracker & Wiki <https://trac.torproject.org/>
The Tor Project: anonymity online