[tor-bugs] #33098 [Internal Services/Tor Sysadmin Team]: fsn-node-03 disk problems

Wed Jan 29 19:38:24 UTC 2020

#33098: fsn-node-03 disk problems
-------------------------------------------------+-------------------------
     Reporter:  anarcat                          |      Owner:  anarcat
         Type:  defect                           |     Status:  assigned
     Priority:  High                             |  Milestone:
    Component:  Internal Services/Tor Sysadmin   |    Version:
  Team                                           |
     Severity:  Blocker                          |   Keywords:
Actual Points:                                   |  Parent ID:
       Points:                                   |   Reviewer:
      Sponsor:                                   |
-------------------------------------------------+-------------------------
 for some reason, the HDD disk on fsn-node-03 is having SMART errors. I
 originally filed this ticket with Hetzner:

 > yesterday, as we got errors from the SMART daemon on this host, looking
 like this:
 >
 > From: root <root at fsn-node-03.torproject.org>
 > Subject: SMART error (ErrorCount) detected on host: fsn-node-03
 > To: root at fsn-node-03.torproject.org
 > Date: Tue, 28 Jan 2020 23:35:35 +0000
 >
 > This message was generated by the smartd daemon running on:
 >
 >    host name:  fsn-node-03
 >    DNS domain: torproject.org
 >
 > The following warning/error was logged by the smartd daemon:
 >
 > Device: /dev/sda [SAT], ATA error count increased from 0 to 1
 >
 > Device info:
 > TOSHIBA MG06ACA10TEY, S/N:..., WWN:...., FW:0103, 10.0 TB
 >
 > For details see host's SYSLOG.
 >
 > You can also use the smartctl utility for further investigation.
 > Another message will be sent in 24 hours if the problem persists.
 >
 > Another such email triggered an hour later as well.
 >
 > The RAID array the disk is on triggered a rebuild as well, somehow. The
 follow
 > messages showed up in dmesg:
 >
 > [Jan28 20:44] md: resync of RAID array md2
 > [Jan28 22:20] ata2.00: exception Emask 0x50 SAct 0x4000 SErr 0x480900
 action 0x6
 > frozen
 > [  +0.004419] ata2.00: irq_stat 0x08000000, interface fatal error
 > [  +0.001489] ata2: SError: { UnrecovData HostInt 10B8B Handshk }
 > [  +0.000781] ata2.00: failed command: WRITE FPDMA QUEUED
 > [  +0.000785] ata2.00: cmd 61/00:70:80:52:f6/05:00:ec:00:00/40 tag 14
 ncq dma
 > 655360 out
 >                        res 40/00:70:80:52:f6/00:00:ec:00:00/40 Emask
 0x50 (ATA bus
 > error)
 > [  +0.001600] ata2.00: status: { DRDY }
 > [  +0.000801] ata2: hard resetting link
 > [  +0.310126] ata2: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
 > [  +0.088155] ata2.00: configured for UDMA/133
 > [  +0.000031] ata2: EH complete
 > [Jan28 23:27] ata1.00: exception Emask 0x50 SAct 0x1c00 SErr 0x280900
 action 0x6
 > frozen
 > [  +0.004338] ata1.00: irq_stat 0x08000000, interface fatal error
 > [  +0.001815] ata1: SError: { UnrecovData HostInt 10B8B BadCRC }
 > [  +0.000772] ata1.00: failed command: READ FPDMA QUEUED
 > [  +0.000738] ata1.00: cmd 60/00:50:00:3b:b1/05:00:47:01:00/40 tag 10
 ncq dma
 > 655360 in
 >                        res 40/00:58:00:40:b1/00:00:47:01:00/40 Emask
 0x50 (ATA bus
 > error)
 > [  +0.001512] ata1.00: status: { DRDY }
 > [  +0.000793] ata1.00: failed command: READ FPDMA QUEUED
 > [  +0.000727] ata1.00: cmd 60/00:58:00:40:b1/05:00:47:01:00/40 tag 11
 ncq dma
 > 655360 in
 >                        res 40/00:58:00:40:b1/00:00:47:01:00/40 Emask
 0x50 (ATA bus
 > error)
 > [  +0.001534] ata1.00: status: { DRDY }
 > [  +0.000769] ata1.00: failed command: READ FPDMA QUEUED
 > [  +0.000720] ata1.00: cmd 60/00:60:00:45:b1/01:00:47:01:00/40 tag 12
 ncq dma
 > 131072 in
 >                        res 40/00:58:00:40:b1/00:00:47:01:00/40 Emask
 0x50 (ATA bus
 > error)
 > [  +0.001453] ata1.00: status: { DRDY }
 > [  +0.000778] ata1: hard resetting link
 > [  +0.556198] ata1: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
 > [  +0.001780] ata1.00: configured for UDMA/133
 > [  +0.000037] ata1: EH complete
 > [Jan28 23:32] perf: interrupt took too long (2518 > 2500), lowering
 > kernel.perf_event_max_sample_rate to 79250
 > [Jan29 00:14] ata2.00: exception Emask 0x50 SAct 0x1c000000 SErr
 0x480900 action
 > 0x6 frozen
 > [  +0.004173] ata2.00: irq_stat 0x08000000, interface fatal error
 > [  +0.001996] ata2: SError: { UnrecovData HostInt 10B8B Handshk }
 > [  +0.000737] ata2.00: failed command: WRITE FPDMA QUEUED
 > [  +0.000729] ata2.00: cmd 61/00:d0:00:62:0e/05:00:86:01:00/40 tag 26
 ncq dma
 > 655360 out
 >                        res 40/00:d0:00:62:0e/00:00:86:01:00/40 Emask
 0x50 (ATA bus
 > error)
 > [  +0.001486] ata2.00: status: { DRDY }
 > [  +0.000854] ata2.00: failed command: WRITE FPDMA QUEUED
 > [  +0.000718] ata2.00: cmd 61/00:d8:00:67:0e/05:00:86:01:00/40 tag 27
 ncq dma
 > 655360 out
 >                        res 40/00:d0:00:62:0e/00:00:86:01:00/40 Emask
 0x50 (ATA bus
 > error)
 > [  +0.001478] ata2.00: status: { DRDY }
 > [  +0.000884] ata2.00: failed command: WRITE FPDMA QUEUED
 > [  +0.000736] ata2.00: cmd 61/00:e0:00:6c:0e/01:00:86:01:00/40 tag 28
 ncq dma
 > 131072 out
 >                        res 40/00:d0:00:62:0e/00:00:86:01:00/40 Emask
 0x50 (ATA bus
 > error)
 > [  +0.001453] ata2.00: status: { DRDY }
 > [  +0.000760] ata2: hard resetting link
 > [  +0.000011] ata1.00: exception Emask 0x50 SAct 0x10000000 SErr
 0x280900 action
 > 0x6 frozen
 > [  +0.000764] ata1.00: irq_stat 0x08000000, interface fatal error
 > [  +0.000725] ata1: SError: { UnrecovData HostInt 10B8B BadCRC }
 > [  +0.000712] ata1.00: failed command: READ FPDMA QUEUED
 > [  +0.000700] ata1.00: cmd 60/80:e0:00:6d:0e/04:00:86:01:00/40 tag 28
 ncq dma
 > 589824 in
 >                        res 40/00:e0:00:6d:0e/00:00:86:01:00/40 Emask
 0x50 (ATA bus
 > error)
 > [  +0.001426] ata1.0...

 I lost the original message as hetzner trims replys, but it also included
 the `smartctl -x` output of the drive, now lost.

 40 minutes later, the drive was replaced and the machine booted again.

 We had trouble with the `/dev/md2` array: for some reason it wouldn't
 autostart after the intervention. I started it by hand, rebuilt the initrd
 and rebooted, to no avail.

 I tried to repartition the new `sda` drive they added, then added it to
 the array, which started syncing.

 But after a while, the error came back:

 {{{
 [Jan29 18:30] ata1.00: exception Emask 0x50 SAct 0x80080 SErr 0x480900
 action 0x6 frozen
  [  +0.000020] ata1.00: irq_stat 0x08000000, interface fatal error
  [  +0.000010] ata1: SError: { UnrecovData HostInt 10B8B Handshk }
  [  +0.000012] ata1.00: failed command: READ FPDMA QUEUED
  [  +0.000018] ata1.00: cmd 60/20:38:00:98:04/00:00:00:00:00/40 tag 7 ncq
 dma 16384 in
                         res 40/00:98:00:e2:ff/00:00:0e:01:00/40 Emask 0x50
 (ATA bus error)
  [  +0.000021] ata1.00: status: { DRDY }
  [  +0.000010] ata1.00: failed command: WRITE FPDMA QUEUED
  [  +0.000015] ata1.00: cmd 61/00:98:00:e2:ff/05:00:0e:01:00/40 tag 19 ncq
 dma 655360 out
                         res 40/00:98:00:e2:ff/00:00:0e:01:00/40 Emask 0x50
 (ATA bus error)
  [  +0.000012] ata1.00: status: { DRDY }
  [  +0.000009] ata1: hard resetting link
  [  +0.311884] ata1: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
  [  +0.049673] ata1.00: configured for UDMA/133
  [  +0.000023] ata1: EH complete
 }}}

 and smartd sent us another email about:

 {{{
 Device: /dev/sda [SAT], ATA error count increased from 0 to 1
 }}}

 i reopened the ticket with hetzner, which will do another visit to the
 server shortly. they also find it strange the error came back, and suspect
 something might be wrong with the SATA cables.

--
Ticket URL: <https://trac.torproject.org/projects/tor/ticket/33098>
Tor Bug Tracker & Wiki <https://trac.torproject.org/>
The Tor Project: anonymity online