[tor-bugs] #24594 [Core Tor/Tor]: Protocol warning: Expiring stuck OR connection to fd...

Mon Dec 11 22:03:33 UTC 2017

#24594: Protocol warning: Expiring stuck OR connection to fd...
-------------------------+-------------------------------------------------
     Reporter:  dgoulet  |      Owner:  (none)
         Type:  defect   |     Status:  new
     Priority:  Medium   |  Milestone:  Tor: 0.3.3.x-final
    Component:  Core     |    Version:
  Tor/Tor                |
     Severity:  Normal   |   Keywords:  tor-sched, libevent, tor-connection
Actual Points:           |  Parent ID:
       Points:           |   Reviewer:
      Sponsor:           |
-------------------------+-------------------------------------------------
 So in theory, this is at protocol warning so shouldn't too problematic but
 I think this worth looking at it. I've been seeing many of these on a test
 relay I have (capped at 200KB/s) using KIST scheduler: (redacting the
 relay addr/port):

 {{{
 Expiring stuck OR connection to fd 380 (IP:PORT). (3747888 bytes to flush;
 3000 seconds since last write)
 }}}

 This is pretty big, 3.7MB stuck in the `outbuf` of a connection. The
 `3000` seconds since last write means that
 `connection_handle_write_impl()` hasn't been called which is *very*
 surprising in the first place.

 There are currently two ways for the handle write function to be called,
 either through the libevent `write_event` which is fired everytime the
 socket is *ready* to write (see this as `POLLLOUT` from poll()). Or, it is
 directly called from KIST scheduler when cells are put in the outbuf.

 This is worrying because it means that KIST did in fact put 3.7MB of cells
 on the outbuf thinking the socket had its TCP buffer stable enough to put
 that data in but somehow none got written on the socket.

 On possibility is that KIST flushed cells on the connection then tried to
 write it to the network, that didn't work, the TCP information of the
 socket is still intact and because KIST doesn't check for errors (#24449),
 nothing happened. Then, somehow, after those 3.7MB were put in the outbuf,
 the channel was never scheduled again for a write because KIST had no idea
 that anything was left in the outbuf from previous flush on the network.

 So then it comes down to the `write_event` to write those cells flushed by
 KIST. Without having a `POLLOUT` event on the socket, nothing will happen
 so the question I have is how can this event was never fired up for 50
 minutes? I kind of feel that the TCP timeout would have kicked in by then
 if there was really a problem... ? But also, that is a _long_ time for an
 idle connection?

--
Ticket URL: <https://trac.torproject.org/projects/tor/ticket/24594>
Tor Bug Tracker & Wiki <https://trac.torproject.org/>
The Tor Project: anonymity online