[tor-bugs] #25347 [Core Tor/Tor]: Tor stops building circuits, and doesn't start when it has enough directory information

Mon Apr 2 20:50:45 UTC 2018

#25347: Tor stops building circuits, and doesn't start when it has enough directory
information
-------------------------------------------------+-------------------------
 Reporter:  teor                                 |          Owner:  asn
     Type:  defect                               |         Status:
                                                 |  needs_revision
 Priority:  Medium                               |      Milestone:  Tor:
                                                 |  0.3.3.x-final
Component:  Core Tor/Tor                         |        Version:  Tor:
                                                 |  0.3.0.6
 Severity:  Normal                               |     Resolution:
 Keywords:  031-backport, 032-backport,          |  Actual Points:
  033-must, tor-guard, tor-client, tbb-          |
  usability-website, tbb-needs,                  |
  033-triage-20180320, 033-included-20180320     |
Parent ID:  #21969                               |         Points:  1
 Reviewer:                                       |        Sponsor:
-------------------------------------------------+-------------------------

Comment (by s7r):

 Replying to [comment:10 asn]:
 > Looking at your logs, it seems like your guard rejected about 230 new
 circuit creations in 15 minutes with the excuse of `RESOURCELIMIT`. And
 your client just kept making more and more circuits to the same guard that
 were getting rejected... I've also noticed this exact same behavior on a
 client of mine recently.
 >

 I see that as well, but this happens more often and Tor has no problems in
 switching to guard 2/3 or even guard 3/3 to maintain functionality. This
 time (it happens rarely) it completely remained in this useless state.

 > My theory on why `RESOURCELIMIT` was used by your guard (given that you
 say that DoS patch was disabled) is that `assign_onionskin_to_cpuworker()`
 failed because `onion_pending_add()` failed because
 `have_room_for_onionskin()` failed. That means that the relay was
 overworked and had way too many cells to process at that time.
 Unfortunately, I can't see whether you are sending NTOR or TAP cells given
 your logs.
 >

 I know for sure the DoS patch is not related because I triple checked all
 3 primary guards and not even one of them was running a Tor version that
 includes the DoS patch we merged. I think I was using only NTOR cells,
 because I was only trying to reach check.tpo and duckduckgo clearnet
 websites.

 > Like you said, I think the most obvious misbehavior here is that you
 keep on hassling your guard even tho it's telling you to relax by sending
 your `RESOURCELIMIT` `DESTROY` cells. Perhaps one approach here would be
 to choose a different guard after a guard has sent us `RESOURCELIMIT`
 cells, in an attempt to unclog the guard and to get better service.
 '''Let's think about this some more:'''
 >
 > What's the best behavior here? Should we mark the guard as down after
 receiving a single `RESOURCELIMIT` cell, or should we hassle the guard a
 bit before giving up?
 >

 This is the most important part we need to take care of. I dislike the
 idea to remove the guard after receiving a single `RESOURCELIMIT` cell. At
 least we should retry it after some time using the exponential backoff
 exactly as we do when one of our primary guards is not running or not
 listed, and maintain the same logic, timing and behavior so we don't have
 to maintain more branches.

 > Most importantly, can we make sure that the `DESTROY` cell came from the
 guard and not from some other node in the path? If we can make sure that
 the `DESTROY` cell came from the guard, this seem to me like a pretty safe
 countermeasure since we should trust the guard to tell us whether it's
 overworked or not.
 >

 As I can understand from arma's comment the `DESTROY` cell can only come
 from the guard.

 > WRT timeline here, I think working on this countermeasure (mark guard as
 down when overworked to get better service) seems like a plausible goal
 for 033, but anything more involved will probably need to wait for 034.
 >
 > Would appreciate feedback from Nick or Tim here :)
 >
 > ----
 >
 > I still can't explain why you managed to bootstrap after hacking your
 state file tho. Perhaps a coincidence? Perhaps you were overworking your
 guard and when you stopped, it relaxed? Perhaps the hack worked
 differently than you imagine? Not sure.

 I sincerely hope so. But it makes me think: for many hours the guard is
 overworked, and when I delete my state file and restart and edit again the
 new state file putting back all the previous 3/3 primary guards that were
 not allowing me to connect, it just connects fine. I don't have any
 evidence that there was something wrong with the state file, and I don't
 see what could be wrong with it, it does not make any sense. It is very
 hard to reproduce / catch this bug in the wild.

--
Ticket URL: <https://trac.torproject.org/projects/tor/ticket/25347#comment:23>
Tor Bug Tracker & Wiki <https://trac.torproject.org/>
The Tor Project: anonymity online