On 06 Jun (20:03:52), George Kadianakis wrote:
David Goulet dgoulet@torproject.org writes:
Greetings!
<snip>
Hello, I'm here to brainstorm about this suggested feature. I don't have a precise plan forward here, so I'm just talking.
Unfortunately, our circuit-level flow control does not apply to the service introduction circuit which means that the intro point is allowed, by the Tor protocol, to send an arbitrary large amount of cells down the circuit. This means for the service that even after the DoS has stopped, it would still receive massive amounts of cells because some are either inflight on the circuit or queued at the intro point ready to be sent (towards the service).
== SENDME VS Token bucket
So it seems like we are going with a token bucket approach (#15516) to rate-limit introduce cells, even tho the rest of the Tor protocol is using SENDME cells. Are we reinventing the wheel here?
I see these as two different approaches.
Relying on the flow control protocol here is nice in practice because the intro point would not relay anything until the service asks for more data. But this can often be influenced by the circuit latency. It could be that the service could handle 10 times what it received but because the SENDME takes 2 second to reach the intro point, then we loose precious "work time".
I think if we rely on the flow control, it will severely impact very popular hidden service that have a nice OnionBalance setup and all. I have no numbers to back that up but that is my intuition.
The token bucket approach is more flexible _especially_ with the idea of having ESTABLISH_INTRO cell having parameters for the token bucket knobs.
That being all said, our short-term goal here is to add INTRODUCE2
rate-limiting (similar to the Guard DoS subsystem deployed early last year) *at* the intro point but much simpler. The goal is to soak up the introduction load directly at the intro points which would help reduce the load on the network overall and thus preserve its health.
== We need to understand the effects of this feature:
First of all, the main thing to note here is that this is a feature that primarily intends to improve network health against DoS adversaries. It achieves this by greatly reducing the amount of useless rendezvous circuits opened by the victim service, which then improves the health of guard nodes (when guard nodes breaks, circuit start retrying endlessly, and hell begins).
We don't know how this feature will impact the availability of an attacked service. Right now, my hypothesis is that even with this feature enabled, an attacked service will remain unusable. That's because an attacker who spams INTRO1 cells will always saturate the intro point and innocent clients with a browser will be very unlikely to get service (kinda like sitting under a waterfall and trying to fill a glass with your spit). That said, with this defense, the service won't be 100% CPU, so perhaps innocent clients who manage to sneak in will get service, whereas now they don't anyhow.
IMO, it's very important to understand exactly how this feature will impact the availability of the service: If this feature does not help the availability of the service, then victim operators will be incentivized to disable the feature (or crank up the limits) which means that we will not improve the health of the network, which is our primary goal here.
This is an experiment we can easily run. Saturate a service intro points (that we control) and run in a loop a client trying to reconnect. See the success rate. I'm also expecting very very very low reachability but who knows, we could be surprised but at least we'll have data points.
== Why are we doing all this?
Another thing I wanted to mention here is the second order effect we are facing. The only reason we are doing all this is because attackers are incentived into attacking onion services. Perhaps the best thing we could do here is to create tools to make denial of service attacks less effective against onion services, which would make attackers stop performing them, and hence we won't need to implement rate-limits to protect the network in case they do. Right now the best things we have towards that direction is the incomplete-but-plausible design of [0] and the inelegant 1b from [1].
This is especially true since to get this rate-limiting feature deployed to the whole network we need all relays (intro points) to upgrade to the new version so we are looking at years in the future anyway.
https://lists.torproject.org/pipermail/tor-dev/2019-June/013862.html
My two cents here are that all those features could complement each other over time. Having a proof-of-work + rate limit can work well together.
But at this juncture in time, what I want most to be fixed is the fact that service are used for an amplification attack. This was disastrous during the 2018 DDoS, saturating Guard nodes constantly. We fixed this by adding DoS defenses at the Guard level which stopped the client madness, but not the service side of things.
Soaking the huge loads on the intro point is a good easy avenue for us to pursue and have very direct impact on the health of the network. And it is always something we can disable with a consensus parameterse if shit hit the fan with it.
One naive approach is to see how much cells an attack can send towards a service. George and I have conducted experiment where with 10 *modified* tor clients bombarding a service at a much faster rate than 1 per-second (what vanilla tor does if asked to connect a lot), we see in 1 minute ~15000 INTRODUCE2 cells at the service. This varies in the thousands depending on different factors but overall that is a good average of our experiment.
This means that 15000/60 = 250 cells per second.
Considering that this is an absurd amount of INTRODUCE2 cells (maybe?), we can put a rate per second of let say a fifth meaning 50 and a burst of 200.
Over the normal 3 intro points a service has, it means 150 introduction per-second are allowed with a burst of 600 in total. Or in other words, 150 clients can reach the service every second up to a burst of 600 at once. This probably will ring alarms bell for very popular services that probably gets 1000+ users a second so please check next section.
I'm not that excited about hardcoded network wide values so this is why the next section is more exciting but much more work for us!
Yes, I'm also very afraid of imposing network wide values here. What happens to hypothetical onion services that outperform the hard limits we impose here, even when they are not DoSed? The limits above are extremely low when we are looking at normal busy websites on the clearnet, so by activating them we are basically putting hard limits to the adoption of onion services.
Perhaps that's something we want to do anyway, because not knowing how many clients an onion service can support is also not ideal, but we should really think twice (and then again twice) before doing it and also talk to some people who manage busy sites in the onionspace and outside of it.
They need to be at least consensus parameters so the entire network can adapt if the default values ends up being very bad or worst, inneffective.
Second thing is that I'm thinking more and more that this feature is not complete/useful without a way for the service operator to have control over those knobs. Fortunately, we have #30790 in the pipe for this.
== What about false positives?
Also given that the rate limiting happens on the intro point layer here, how does a service learn that it's getting DoSed? Are we looking at a special IP->HS cell that says "we are throttling your clients"? How much to overengineer here?
For now, it would be unnoticed by the operator for which I'm not that worried about. Likely scenario here is that users starts complaining to the service operator that they can't reach it.
== What's the ideal client behavior when the limit gets hit?
So given that these hard limits can be hit quite easily by an attacker, what is the client behavior when they get hit? Will normal clients keep on retrying intro points until they get service, and continuously extending their circuits? This behavior is particularly important for the availability of the service under this feature.
The code right now, in #15516, will send a NACK. The reason for this is because we want legit client to re-extend and not create a new intro circuit. More efficient and less pressure on the network.
After getting NACKed by all introduction points, the client will stop retyring. It will be allowed to retry when the "failure cache" cleans up which is right now 5 minutes time out. Or if new intro point are found in a new descriptor.
I'm in favor of the re-extend option here which is the normal behavior client will encounter in normal circumstances. And also the one that creates less pressure.
Cheers! David