[tor-bugs] #29672 [Internal Services/Service - trac]: trac gets overwhelmed

Tor Bug Tracker & Wiki blackhole at torproject.org
Thu Apr 11 17:10:30 UTC 2019


#29672: trac gets overwhelmed
----------------------------------------------+--------------------------
 Reporter:  anarcat                           |          Owner:  qbi
     Type:  defect                            |         Status:  assigned
 Priority:  High                              |      Milestone:
Component:  Internal Services/Service - trac  |        Version:
 Severity:  Critical                          |     Resolution:
 Keywords:                                    |  Actual Points:
Parent ID:                                    |         Points:
 Reviewer:                                    |        Sponsor:
----------------------------------------------+--------------------------

Comment (by anarcat):

 today trac hung badly - all requests were giving 503 errors to client and
 the machine was maxing its CPU and memory. i found this in the error log:

 {{{
 [Thu Apr 11 16:30:23.749569 2019] [wsgi:error] [pid 22934:tid
 140416296871680] (11)Resource temporarily unavailable: [client
 [REDACTED]:40900] mod_wsgi (pid=22934): Unable to connect to WSGI daemon
 process 'trac.torproject.org' on '/var/run/apache2/wsgi.2106.9.1.sock'
 after multiple attempts as listener backlog limit was exceeded.
 }}}

 The `trac.log` was full of:

 {{{
 IOError: Apache/mod_wsgi failed to write response data: Broken pipe
 }}}

 CPU and memory had been maxed out for more than two hours already when the
 outage started:

 [[Image(https://paste.anarc.at/snaps/snap-2019.04.11-12.53.48.png,700)]]

 Apache was also seeing more hits than usual:

 [[Image(https://paste.anarc.at/snaps/snap-2019.04.11-12.57.04.png,700)]]

 But I don't believe it was starving out of resources:

 [[Image(https://paste.anarc.at/snaps/snap-2019.04.11-12.58.56.png,700)]]

 It's possible the pgsql database got overwhelmed. We don't have metrics
 for that in prometheus because, ironically enough, I just decided
 yesterday it might have been overkill. Maybe we should revise that
 decision now.

 I wonder if our WSGI config could be tweaked. This is what we have right
 now:

 {{{
 WSGIDaemonProcess trac.torproject.org user=tracweb group=tracweb home=/
 processes=6 threads=10 maximum-requests=5000 inactivity-timeout=1800
 umask=0007 display-name=wsgi-trac.torproject.org
 }}}

 I've decided to make more of those settings explicit to see if some tweaks
 might be useful:

 {{{
 WSGIDaemonProcess trac.torproject.org user=tracweb group=tracweb home=/
 processes=6 threads=10 maximum-requests=5000 inactivity-timeout=1800
 umask=0007 graceful-timeout=30 restart-interval=30 response-socket-
 timeout=10 display-name=wsgi-trac.torproject.org
 }}}

 The server was rebooted, which fixed the problem, but we'll see if the
 above tweaks might fix the problem in the future.

 Failing that, a good path to take next time is to look at whether the
 database is overloaded - it would explain why the frontend is falling over
 without a clear explanation, although it must be said that most of the CPU
 was taken by WSGI processes, not pgsql.

--
Ticket URL: <https://trac.torproject.org/projects/tor/ticket/29672#comment:2>
Tor Bug Tracker & Wiki <https://trac.torproject.org/>
The Tor Project: anonymity online


More information about the tor-bugs mailing list