-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA512
Hi Roger, I was hoping you'd get to this eventually. :)
Roger Dingledine:
On Sun, Oct 20, 2013 at 09:42:01AM -0700, Gordon Morehouse wrote:
With the slower computers, sometimes too many attempts to connect to the ORPort (I am almost positive as part of TAP circuit building, but not *really* sure) can eventually cause Tor to consume more physmem than available and cause the oom-killer to kill Tor. Also, depending on the crappiness of the user's router, it's effectively a SYN flood, and can crash or impair consumer routers.
This doesn't sound like circuit building. It sounds like TLS handshakes.
Very good to know.
You see, a new circuit handshake (TAP or NTor) is simply a 512-byte cell sent along an already established TCP connection. So if you're getting flooded by circuit handshakes, it will be traffic (which causes cpu load) but it won't be any new TCP connections.
If you're seeing a bunch of new TCP connections, that sounds like clients trying to establish a new OR connection with you. (And those TLS handshakes are done in the core Tor thread, so having a weak CPU while handling a lot of TLS handshakes will cause your other Tor operations to hiccup.)
This is what's going on, and it's often relatively soon after I get my Stable flag.
My solution, so far, is to define (through trial and error on a per-machine basis, since [1] is only officially supporting 3 SBCs right now) limits on how many SYNs may be sent to the ORPort and the DirPort per second. This is done with iptables. I experimented, tuned the parameters and watched traffic for weeks and came up with a pretty good set of limits for a 950MHz Raspberry Pi: 4 SYNs/sec burst 10. (For those about to say the Pi is thus too slow to be used as a relay, it's quite capable of relaying *at least* 2.5Mbps, but *not* when it's getting SYN flooded.)
My first question is to wonder if this flood of clients connections is coming from a few IP addresses or many IP addresses. And to wonder if it's coming from Tor relays or not.
I was lucky enough to catch a "storm" just starting a couple mornings ago, and am going to try to dissect the logs and my realtime observations and provide a report - I expect it'd be useful to more than just me and my single-board computer project.
After watching the data, I noticed that some hosts just try to connect once or twice, or try to connect (during overload conditions) at reasonable intervals of tens of seconds to a few minutes. Other hosts will quadruple-tap the ORPort with SYNs, four in a row, and otherwise be much more aggressive with sending SYNs.
Sounds like you are seeing variations in TCP implementations.
Yep, that's what I figured.
Currently, if a peer violates the 4/sec burst 10 SYN limit more than 5 times in 60 seconds, that peer will be banned for 90 seconds. I'm trying to trim this down to the minimum that will protect the relay, and 90 seconds is a guess given some of my fears, read on...
That brings up a second question: if you *do* let them establish a TLS connection with you, do they stop hammering you? Or do they always want more? How long until they hang up on a connection that you allow to establish.
I'm not entirely sure yet, and I need to do some log-data crunching. Do you know offhand how long it will take Tor to give up on connecting to a peer if it seems down for a while?
First, during a SYN flood type overload, some peers which have *existing* circuits built through the relay and are sending SYNs as normal traffic, will stochastically get "caught" in the filter and banned for a short time.
Wait, what? SYN packets are not part of normal traffic for an established connection.
I incorrectly assumed that new circuit requests began with a TCP handshake. However, *if* the peer were being flooded, and a peer that was already connected to the relay happened to send 4 SYN packets which arrived after other hosts had exceeded the limit for that given second, the unlucky peer would still get banned. David Serrano suggested an amendment to my iptables rules, which I've implemented, which *may* immunize ESTABLISHED connections from the fail2ban ban; he's helping me piece out whether that actually works or not.
What would be good to know from you is how often already-connected peers would be TCP handshaking to a relay's ORPort or DirPort.
So here's the $64,000 question:
If a tor relay has a circuit built through a peer, and the peer starts dropping 100% of packets, how long will it take before the relay with the circuit "gives up" on the circuit and tears it down?
That depends on the TCP implementation on both sides. I imagine the answer varies widely. Which probably isn't what you wanted to hear.
Is there not a piece in Tor's connect-to-peer code which says "try for N seconds, or P retries, then give up?"
Thanks much for your input.
- -Gordon M.