I've seen this about three times, maybe four. Anyone have any idea what's going on? Version 0.2.6.10 on Linux. Showed up in 0.2.5.x and 0.2.4.x earlier.
1) seems to happen only one time per relay start after a few hours up to a couple of days; weeks or months between occurrences
2) might be happening after dynamic config change but not certain (did this time, one hour later)
3) might related to / triggered by client activity--was simply browsing regular Internet at the time
4) main thread appears to bog-down, but some traffic continues
5) log messages tied to event
:20:26 : Failed to find node for hop 0 of our path. Discarding this circuit. :20:26 : Tried for 120 seconds to get a connection to [scrubbed]:443. Giving up. :20:29 : Failed to find node for hop 0 of our path. Discarding this circuit. :20:32 : Failed to find node for hop 0 of our path. Discarding this circuit. :20:32 : Tried for 122 seconds to get a connection to [scrubbed]:443. Giving up. (waiting for circuit) :20:34 : Failed to find node for hop 0 of our path. Discarding this circuit. :21:06 : last message repeated 14 times :21:45 : last message repeated 23 times
6) for several minutes, all outbound client connection attempts show <fail> in Vidalia; NEWNYM has no effect
7) (5) and (6) imply a rate-limit by the guard but AFIK there was no burst of circuit request that would trigger this
8) client functionality recovers in something like five minutes (awhile after relay forwarding traffic knotch ends)
9) no increase in memory consumption, so apparently no buffering backlog
10) WAN clean, no glitches
I suppose I should submit a bug report, but I just finished with a bug-from-hell and am in no mood to work another one, especially since the problem is rare and not fatal.
Strange, right after the event the reported number of "circuits" went up hugely --have never seen this:
Heartbeat: Tor's uptime is 4 days 0:00 hours, with 5359 circuits open. . . Heartbeat: Tor's uptime is 4 days 6:00 hours, with 17524 circuits open. . .
But the line-count from "GETINFO orconn-status" is around 4100.
Performed a SIGUSR1 dump and see the 17k number here too, but both the "Conn" and "Channel" lists are around 4k:
==========
Dumping stats:
$ egrep '] Conn [0-9]+ is ' logfile | wc -l 4204
Dumping statistics about 4193 channels: 4193 are active, and 0 are done and waiting for cleanup
$ fgrep '] Channel ' logfile | wc -l 4194
--------------- Dumping memory information: In buffers for 4208 connections: 59566 used/565248 allocated For 4193 OR connections: 1002 used/458752 allocated In 7482 live descriptors: 10293977 bytes. In 14832 old descriptors: 20384003 bytes. 0 cells allocated on 17017 circuits. 2 cells leaked.
$ fgrep '] OR ' logfile | wc -l 11132
==========
Report is confusing, what is a "Conn?" a "Channel?" a "circuit?". The Tor specification does not discuss this. Nothing from Google search.
Identified the issue:
Problem occurs when the client function in the relay daemon cannot establish circuits to the guard for any reason whether because it's down or because DDOS/sniper attack rate limiting has been activated by the guard.
This shows up most obviously when the guard(s) are manually configured, but it's noticeable in the Vidalia graph that traffic locks up for some large fraction of a second when circuit creation is under way under otherwise normal operation.
Something is taking control away from the event loop for much longer than correct implementation recommends.
opened trac ticket:
#16585 relay stops forwarding all traffic when client function cannot establish circuit https://trac.torproject.org/projects/tor/ticket/16585
tor-relays@lists.torproject.org