[tor-bugs] #32692 [Internal Services/Tor Sysadmin Team]: meronense database server crashes on I/O errors

Sat Dec 7 17:05:27 UTC 2019

#32692: meronense database server crashes on I/O errors
-----------------------------------------------------+-----------------
     Reporter:  anarcat                              |      Owner:  tpa
         Type:  defect                               |     Status:  new
     Priority:  Medium                               |  Milestone:
    Component:  Internal Services/Tor Sysadmin Team  |    Version:
     Severity:  Normal                               |   Keywords:
Actual Points:                                       |  Parent ID:
       Points:                                       |   Reviewer:
      Sponsor:                                       |
-----------------------------------------------------+-----------------
 meronense crashed today. local time (UTC-5)

 {{{
 08:35:18 <nsa> tor-nagios: [meronense] system - filesystem check is
 WARNING: /dev/mapper/vg_meronense-srv (/home) has 2 errors
 09:03:53 <nsa> tor-nagios: [meronense] process - postgresql96 - master is
 CRITICAL: PROCS CRITICAL: 0 processes with UID = 112 (postgres), command
 name postgres, args
                /usr/lib/postgresql/9.6/bin/postgres
 09:08:33 <nsa> tor-nagios: [meronense] system - all services running is
 CRITICAL: CHECK_NRPE STATE CRITICAL: Socket timeout after 50 seconds.
 09:22:43 <nsa> tor-nagios: [meronense] system - all services running is
 WARNING: degraded
 }}}

 i restarted psql when i noticed:

 {{{
 10:11:50 <anarcat> restarted postgres on meronense, unknown reason crash
 }}}

 but it crashed again.

 the errors actually started here (UTC), in
 /var/log/postgresql/postgresql-9.6-main.log:

 {{{
 2019-12-07 12:09:30 GMT WARNING:  corrupted statistics file
 "pg_stat_tmp/global.stat"
 [repeats until]
 2019-12-07 12:44:28 GMT WARNING:  corrupted statistics file
 "pg_stat_tmp/global.stat"
 2019-12-07 12:44:28 GMT LOG:  using stale statistics instead of current
 ones because stats collector is not responding
 2019-12-07 12:44:28 GMT WARNING:  corrupted statistics file
 "pg_stat_tmp/global.stat"
 2019-12-07 12:47:26 GMT PANIC:  could not fsync file "base/16401/17372":
 Input/output error
 2019-12-07 12:47:26 GMT LOG:  checkpointer process (PID 1284) was
 terminated by signal 6: Aborted
 2019-12-07 12:47:26 GMT LOG:  terminating any other active server
 processes
 2019-12-07 12:47:26 GMT WARNING:  terminating connection because of crash
 of another server process
 2019-12-07 12:47:26 GMT DETAIL:  The postmaster has commanded this server
 process to roll back the current transaction and exit, because another
 server process exited abnormally and possibly
 corrupted shared memory.
 2019-12-07 12:47:26 GMT HINT:  In a moment you should be able to reconnect
 to the database and repeat your command.
 2019-12-07 12:47:26 GMT LOG:  archiver process (PID 1288) exited with exit
 code 1
 2019-12-07 12:47:26 GMT WARNING:  terminating connection because of crash
 of another server process
 2019-12-07 12:47:26 GMT DETAIL:  The postmaster has commanded this server
 process to roll back the current transaction and exit, because another
 server process exited abnormally and possibly
 corrupted shared memory.
 2019-12-07 12:47:26 GMT HINT:  In a moment you should be able to reconnect
 to the database and repeat your command.
 2019-12-07 12:47:26 GMT CONTEXT:  SQL statement "UPDATE bwhist
             SET read = array_merge(read, insert_read),
                 written = array_merge(written, insert_written),
                 dirread = array_merge(dirread, insert_dirread),
                 dirwritten = array_merge(dirwritten, insert_dirwritten)
             WHERE fingerprint = insert_fingerprint AND date = insert_date"
         PL/pgSQL function
 insert_bwhist(character,date,bigint[],bigint[],bigint[],bigint[]) line 12
 at SQL statement
 2019-12-07 12:47:36 GMT LOG:  statistics collector process (PID 1289) was
 terminated by signal 9: Killed
 2019-12-07 12:47:36 GMT LOG:  all server processes terminated;
 reinitializing
 2019-12-07 12:47:58 GMT LOG:  database system was interrupted; last known
 up at 2019-12-07 11:02:32 GMT
 2019-12-07 12:48:09 GMT LOG:  could not fsync file
 "./base/16401/17372_fsm": Input/output error
 2019-12-07 12:48:09 GMT LOG:  could not fsync file
 "./base/16401/17372_vm": Input/output error
 2019-12-07 12:48:09 GMT LOG:  could not fsync file "./base/16401/17488":
 Input/output error
 2019-12-07 12:48:17 GMT LOG:  could not fsync file
 "./pg_stat_tmp/global.stat": Input/output error
 2019-12-07 12:48:17 GMT LOG:  database system was not properly shut down;
 automatic recovery in progress
 2019-12-07 12:48:17 GMT LOG:  redo starts at 3D7/57E470D8
 2019-12-07 12:48:17 GMT LOG:  redo done at 3D7/5BB04860
 2019-12-07 13:49:37 GMT PANIC:  could not fsync file "base/16401/17488":
 Input/output error
 2019-12-07 13:49:37 GMT LOG:  startup process (PID 20894) was terminated
 by signal 6: Aborted
 2019-12-07 13:49:37 GMT LOG:  aborting startup due to startup process
 failure
 2019-12-07 13:49:37 GMT LOG:  database system is shut down
 }}}

 here's my attempt at restarting it:

 {{{

 2019-12-07 15:11:33 GMT FATAL:  the database system is starting up
 2019-12-07 15:11:33 GMT FATAL:  the database system is starting up
 2019-12-07 15:11:34 GMT LOG:  could not stat file
 "./pg_stat_tmp/global.stat": Structure needs cleaning
 2019-12-07 15:11:34 GMT LOG:  database system was not properly shut down;
 automatic recovery in progress
 2019-12-07 15:11:34 GMT FATAL:  the database system is starting up
 2019-12-07 15:11:34 GMT FATAL:  the database system is starting up
 2019-12-07 15:11:35 GMT FATAL:  the database system is starting up
 2019-12-07 15:11:35 GMT FATAL:  the database system is starting up
 2019-12-07 15:11:36 GMT FATAL:  the database system is starting up
 2019-12-07 15:11:36 GMT LOG:  redo starts at 3D7/57E470D8
 2019-12-07 15:11:36 GMT FATAL:  the database system is starting up
 2019-12-07 15:11:37 GMT FATAL:  the database system is starting up
 2019-12-07 15:11:37 GMT FATAL:  the database system is starting up
 2019-12-07 15:11:38 GMT FATAL:  the database system is starting up
 2019-12-07 15:11:38 GMT LOG:  incomplete startup packet
 2019-12-07 15:11:46 GMT LOG:  redo done at 3D7/5BB04860
 2019-12-07 15:48:19 GMT PANIC:  could not fsync file "base/16401/17488":
 Input/output error
 2019-12-07 15:48:19 GMT LOG:  startup process (PID 27916) was terminated
 by signal 6: Aborted
 2019-12-07 15:48:19 GMT LOG:  aborting startup due to startup process
 failure
 2019-12-07 15:48:19 GMT LOG:  database system is shut down
 }}}

 then i shutdown the host from kvm4, and restarted it. then psql was able
 to recover:

 {{{

 2019-12-07 16:54:17 GMT LOG:  incomplete startup packet
 2019-12-07 16:54:17 GMT LOG:  database system shutdown was interrupted;
 last known up at 2019-12-07 15:26:47 GMT
 2019-12-07 16:54:18 GMT FATAL:  the database system is starting up
 2019-12-07 16:54:18 GMT FATAL:  the database system is starting up
 2019-12-07 16:54:18 GMT LOG:  database system was not properly shut down;
 automatic recovery in progress
 2019-12-07 16:54:18 GMT LOG:  redo starts at 3D7/57E470D8
 2019-12-07 16:54:19 GMT FATAL:  the database system is starting up
 2019-12-07 16:54:19 GMT FATAL:  the database system is starting up
 2019-12-07 16:54:20 GMT FATAL:  the database system is starting up
 2019-12-07 16:54:20 GMT FATAL:  the database system is starting up
 2019-12-07 16:54:21 GMT FATAL:  the database system is starting up
 2019-12-07 16:54:21 GMT FATAL:  the database system is starting up
 2019-12-07 16:54:22 GMT FATAL:  the database system is starting up
 2019-12-07 16:54:22 GMT FATAL:  the database system is starting up
 2019-12-07 16:54:23 GMT FATAL:  the database system is starting up
 2019-12-07 16:54:23 GMT LOG:  incomplete startup packet
 2019-12-07 16:54:23 GMT LOG:  redo done at 3D7/5BB04860
 2019-12-07 16:57:28 GMT LOG:  MultiXact member wraparound protections are
 now enabled
 2019-12-07 16:57:28 GMT LOG:  autovacuum launcher started
 2019-12-07 16:57:28 GMT LOG:  database system is ready to accept
 connections

 }}}

 disk arrays are fine according to mdadm on kvm4,the parent host:

 {{{
 root at kvm4:~# cat /proc/mdstat
 Personalities : [raid1] [linear] [multipath] [raid0] [raid6] [raid5]
 [raid4] [raid10]
 md2 : active raid1 sda[2] sdb[3]
       3906887488 blocks super 1.2 [2/2] [UU]
       bitmap: 12/30 pages [48KB], 65536KB chunk

 md1 : active raid1 nvme1n1p2[1] nvme0n1p2[0]
       479002944 blocks super 1.2 [2/2] [UU]
       bitmap: 3/4 pages [12KB], 65536KB chunk

 md0 : active raid1 nvme1n1p1[1] nvme0n1p1[0]
       20955136 blocks super 1.2 [2/2] [UU]

 unused devices: <none>
 }}}

 this is the second I see this happen. not sure what's going on. i need to
 run now, would love another pair of eyes on this.

--
Ticket URL: <https://trac.torproject.org/projects/tor/ticket/32692>
Tor Bug Tracker & Wiki <https://trac.torproject.org/>
The Tor Project: anonymity online