David Goulet dgoulet@torproject.org writes:
Greetings!
<snip>
Hello, I'm here to brainstorm about this suggested feature. I don't have a precise plan forward here, so I'm just talking.
Unfortunately, our circuit-level flow control does not apply to the service introduction circuit which means that the intro point is allowed, by the Tor protocol, to send an arbitrary large amount of cells down the circuit. This means for the service that even after the DoS has stopped, it would still receive massive amounts of cells because some are either inflight on the circuit or queued at the intro point ready to be sent (towards the service).
== SENDME VS Token bucket
So it seems like we are going with a token bucket approach (#15516) to rate-limit introduce cells, even tho the rest of the Tor protocol is using SENDME cells. Are we reinventing the wheel here?
That being all said, our short-term goal here is to add INTRODUCE2
rate-limiting (similar to the Guard DoS subsystem deployed early last year) *at* the intro point but much simpler. The goal is to soak up the introduction load directly at the intro points which would help reduce the load on the network overall and thus preserve its health.
== We need to understand the effects of this feature:
First of all, the main thing to note here is that this is a feature that primarily intends to improve network health against DoS adversaries. It achieves this by greatly reducing the amount of useless rendezvous circuits opened by the victim service, which then improves the health of guard nodes (when guard nodes breaks, circuit start retrying endlessly, and hell begins).
We don't know how this feature will impact the availability of an attacked service. Right now, my hypothesis is that even with this feature enabled, an attacked service will remain unusable. That's because an attacker who spams INTRO1 cells will always saturate the intro point and innocent clients with a browser will be very unlikely to get service (kinda like sitting under a waterfall and trying to fill a glass with your spit). That said, with this defense, the service won't be 100% CPU, so perhaps innocent clients who manage to sneak in will get service, whereas now they don't anyhow.
IMO, it's very important to understand exactly how this feature will impact the availability of the service: If this feature does not help the availability of the service, then victim operators will be incentivized to disable the feature (or crank up the limits) which means that we will not improve the health of the network, which is our primary goal here.
---
== Why are we doing all this?
Another thing I wanted to mention here is the second order effect we are facing. The only reason we are doing all this is because attackers are incentived into attacking onion services. Perhaps the best thing we could do here is to create tools to make denial of service attacks less effective against onion services, which would make attackers stop performing them, and hence we won't need to implement rate-limits to protect the network in case they do. Right now the best things we have towards that direction is the incomplete-but-plausible design of [0] and the inelegant 1b from [1].
This is especially true since to get this rate-limiting feature deployed to the whole network we need all relays (intro points) to upgrade to the new version so we are looking at years in the future anyway.
[0]: https://lists.torproject.org/pipermail/tor-dev/2019-May/013849.html https://lists.torproject.org/pipermail/tor-dev/2019-June/013862.html [1]: https://lists.torproject.org/pipermail/tor-dev/2019-April/013790.html
One naive approach is to see how much cells an attack can send towards a service. George and I have conducted experiment where with 10 *modified* tor clients bombarding a service at a much faster rate than 1 per-second (what vanilla tor does if asked to connect a lot), we see in 1 minute ~15000 INTRODUCE2 cells at the service. This varies in the thousands depending on different factors but overall that is a good average of our experiment.
This means that 15000/60 = 250 cells per second.
Considering that this is an absurd amount of INTRODUCE2 cells (maybe?), we can put a rate per second of let say a fifth meaning 50 and a burst of 200.
Over the normal 3 intro points a service has, it means 150 introduction per-second are allowed with a burst of 600 in total. Or in other words, 150 clients can reach the service every second up to a burst of 600 at once. This probably will ring alarms bell for very popular services that probably gets 1000+ users a second so please check next section.
I'm not that excited about hardcoded network wide values so this is why the next section is more exciting but much more work for us!
Yes, I'm also very afraid of imposing network wide values here. What happens to hypothetical onion services that outperform the hard limits we impose here, even when they are not DoSed? The limits above are extremely low when we are looking at normal busy websites on the clearnet, so by activating them we are basically putting hard limits to the adoption of onion services.
Perhaps that's something we want to do anyway, because not knowing how many clients an onion service can support is also not ideal, but we should really think twice (and then again twice) before doing it and also talk to some people who manage busy sites in the onionspace and outside of it.
== What about false positives?
Also given that the rate limiting happens on the intro point layer here, how does a service learn that it's getting DoSed? Are we looking at a special IP->HS cell that says "we are throttling your clients"? How much to overengineer here?
== What's the ideal client behavior when the limit gets hit?
So given that these hard limits can be hit quite easily by an attacker, what is the client behavior when they get hit? Will normal clients keep on retrying intro points until they get service, and continuously extending their circuits? This behavior is particularly important for the availability of the service under this feature.
---
These are some thoughts I have about this. As you can see I'm also confused and thinking about this topic :)