[tor-bugs] #33098 [Internal Services/Tor Sysadmin Team]: fsn-node-03 disk problems

Tor Bug Tracker & Wiki blackhole at torproject.org
Thu Jan 30 19:36:33 UTC 2020


#33098: fsn-node-03 disk problems
-------------------------------------------------+-------------------------
 Reporter:  anarcat                              |          Owner:  anarcat
     Type:  defect                               |         Status:
                                                 |  assigned
 Priority:  High                                 |      Milestone:
Component:  Internal Services/Tor Sysadmin Team  |        Version:
 Severity:  Blocker                              |     Resolution:
 Keywords:                                       |  Actual Points:
Parent ID:                                       |         Points:
 Reviewer:                                       |        Sponsor:
-------------------------------------------------+-------------------------

Comment (by anarcat):

 and we got more SMART email messages about this server. now it's also sdb
 that's complaining. ]

 i also noticed that sdb has been complaining even before i opened that
 ticket with Hetzner. in fact, what triggered me to open that ticket is the
 second smartd email, which i mistakenly thought was caused by sda errors:
 the second email we got was about sdb! so changing sda wouldn't have
 solved that problem.

 i commented on hetzner's ticket with the following:

 > We're still having trouble with this server.
 >
 > After a full RAID-1 resync, I rebooted the box, but the new disk was
 > kicked out of the array, and not detected as having a RAID superblock:
 >
 > {{{
 > root at fsn-node-03:~# mdadm -E /dev/sda1
 > mdadm: No md superblock detected on /dev/sda1.
 > }}}
 >
 > When I started the array and readded the disk, it started a full resync
 > again:
 >
 > {{{
 > root at fsn-node-03:~# mdadm --run /dev/md2
 > mdadm: started array /dev/md/2
 > root at fsn-node-03:~# mdadm /dev/md2 -a /dev/sda1
 > mdadm: added /dev/sda1
 > root at fsn-node-03:~# cat /proc/mdstat
 > Personalities : [raid1] [linear] [multipath] [raid0] [raid6] [raid5]
 [raid4] [raid10]
 > md2 : active raid1 sda1[2] sdb1[1]
 >       9766302720 blocks super 1.2 [2/1] [_U]
 >       [>....................]  recovery =  0.0% (274048/9766302720)
 finish=593.9min speed=274048K/sec
 >       bitmap: 0/73 pages [0KB], 65536KB chunk
 >
 > md1 : active raid1 nvme0n1p3[0] nvme1n1p3[1]
 >       937026560 blocks super 1.2 [2/2] [UU]
 >       bitmap: 1/7 pages [4KB], 65536KB chunk
 >
 > md0 : active raid1 nvme0n1p2[1] nvme1n1p2[0]
 >       523264 blocks super 1.2 [2/2] [UU]
 >
 > unused devices: <none>
 > }}}
 >
 > Furthermore, I just noticed that have received smartd notifications
 about the *OTHER*
 > hard drive (sdb):
 >
 > > Date: Wed, 29 Jan 2020 23:46:38 +0000
 > >
 > > [...]
 > >
 > > Device: /dev/sdb [SAT], ATA error count increased from 4 to 5
 > >
 > > Device info:
 > > TOSHIBA MG06ACA10TEY, S/N:[...], WWN:[...], FW:0103, 10.0 TB
 >
 > We have also seen errors from sdb, the second drive, *before* we opened
 > this ticket. That was my mistake: I thought the errors were both from
 > the same disk, I couldn't imagine both disks were giving out errors.
 >
 > At this point, I am wondering if it might not be better to just
 > commission a completely new machine than trying to revive this one. I
 > get the strong sense something is wrong with the disk controller on that
 > one. We have two other PX62 servers with the same identical setup
 > (fsn-node-01/PX62-NVMe #[...], fsn-node-02/PX62-NVMe #[...]).
 > Both are in production and neither show the same disk problems.
 >
 > In any case, I can't use the box like this: its (software) RAID array
 > doesn't survive reboots which tells me there's something very wrong with
 > this machine.
 >
 > Could you look into this again please?

 So I think that, worst case, they just swap the machine and we reinstall.

--
Ticket URL: <https://trac.torproject.org/projects/tor/ticket/33098#comment:2>
Tor Bug Tracker & Wiki <https://trac.torproject.org/>
The Tor Project: anonymity online


More information about the tor-bugs mailing list