On 30 May (09:49:26), David Goulet wrote:
Greetings!
[snip]
Hi everyone,
I'm writing here to update on where we are about the introduction rate limiting at the intro point feature.
The branch of #15516 (https://trac.torproject.org/15516) is ready to be merged upstream which implements a simple rate/burst combo for controlling the amount of INTRODUCE2 cells that are relayed to the service.
As previously detailed in this thread, the default values are a rate of 25 introduction per second and a burst of 200 per second. These values can be controlled by consensus parameters meaning they can be changed network wide.
We've first asked big service operators, I'm not going to detail the values they provided us in private, but those defaults are quite large enough to sustain heavy traffic from what we can tell from what they gave us.
The second thing we did is do experimental testing to see how CPU usage and availability is affected. We've tested this with 3 _fast_ introduction points and then 3 rate limited introduction points.
The good news is that once the attack stops, the rain of introduction requests to the service stops very quickly.
With the default rate/burst values, on a Intel(R) Xeon(R) CPU E5-2650 v4 @ 2.20GHz (8 cores), the tor service CPU doesn't go above ~60% (on one single core). And almost drops to 0 as soon as the attack ends.
The bad news is that availability is _not_ improved. One of the big reasons for that is because the rate limit defenses, once engaged at the intro point, will send back a NACK to the client. A vanilla tor client will stop using that introduction point away for 120 seconds if it gets 3 NACKs from it. This leads to tor quickly giving up on trying to connect and thus telling the client that connection is impossible to the .onion.
We've hacked a tor client to play along and stop ignoring the NACKs to see how much time it would take to reach it. On average, a client would roughly need around 70 seconds with more than 40 NACKs on average.
However, it varied a _lot_ during our experiments with many outliers from 8 seconds with 1 NACK up to 160 seconds with 88 NACKs. (For this, the SocksTimeout had to be bumped quite a bit).
There is an avenue of improvement here to make the intro point sends a specific NACK reason (like "Under heavy load" or ...) which would make the client consider it like "I should retry soon-ish" and thus making the client possibly able to connect after many seconds (or until the SocksTimeout).
Another bad news there! We can't do that anytime soon because of this bug that basically crash clients if an unknown status code is sent back (that is a new NACK value): https://trac.torproject.org/30454. So yeah... quite unfortunate there but also a superb reason for everyone out there to upgrade :).
One good news is that it seems that having fast intro points instead of slow IPs doesn't change much on the overall load on the service so this for now, our experiment, shows it doesn't matter.
Overall, this rate limit feature does two things:
1. Reduce the overall network load.
Soaking the introduction requests at the intro point helps avoid the service creating pointless rendezvous circuits which makes it "less" of an amplification attack.
2. Keep the service usable.
The tor daemon doesn't go in massive CPU load and thus can be actually used properly during the attack.
The problem with (2) is the availability part where for a legit client to reach the service, it is close to impossible for a vanilla tor without lots of luck. However, if let say the tor daemon would be configured with 2 .onion where one is public and the other one is private with client authorization, then the second .onion would be totally usable due to the tor daemon not being CPU overloaded.
As a third thing we did about this. In order to make this feature a bit more "malleable", we are working on https://trac.torproject.org/30924 which is proposal 305.
In short, torrc options are added so an operator can change the rate/burst that the intro points will use. We can do that using the ESTABLISH_INTRO cell that will have an extension to define the DoS defense parameters (proposal 305).
That way, a service operator can disable this feature, or turn the knobs on the rate/burst in order to basically adjust the defenses.
At this point in time, we don't have a good grasp on what happens in terms of CPU if the rate or the burst is bumped up or even how availability is affected. During our experimentation, we did observed a "sort of" linear progression between CPU usage and rate. But we barely touched the surface since it was changed from 25 to 50 to 75 and that is it.
We would require much more experimentation which is something we want to avoid as much as possible on the real network.
Finally, many more changes are cooking up. One in particular is https://trac.torproject.org/projects/tor/ticket/26294 that will make tor to only rotate its intro points when the number of introduction requests is between 150k to 300k (random value) which currently is between 16k and 32k. See the ticket for the benefits here which mostly helps with (1).
There has been much talk about a client PoW (see the proposal 305 thread on this list) which in theory would help out with service availability.
We will also soon merge upstream this ticket https://trac.torproject.org/24962 which goes one step further at denying single-hop connections to the HSDir/Intro in order to try as much as possible to shutdown the Tor2web connections (or any attacker that speeds things up on their side by single hoping).
We are making progress here... This is really a non trivial problem and solution for service availability are not that simple. Our priority is to protect the network as much as possible and then move to possible solutions for availability.
I'll stop for now. Huge thanks to everyone who provided service logs, ideas, code review and future testers :).
Cheers! David