[tor-dev] Onion Service - Intropoint DoS Defenses

Thu Jun 6 17:03:52 UTC 2019

David Goulet <dgoulet at torproject.org> writes:

> Greetings!
>
> <snip>
>

Hello, I'm here to brainstorm about this suggested feature. I don't have
a precise plan forward here, so I'm just talking.

> Unfortunately, our circuit-level flow control does not apply to the
> service introduction circuit which means that the intro point is
> allowed, by the Tor protocol, to send an arbitrary large amount of cells
> down the circuit.  This means for the service that even after the DoS
> has stopped, it would still receive massive amounts of cells because
> some are either inflight on the circuit or queued at the intro point
> ready to be sent (towards the service).
> 

== SENDME VS Token bucket

So it seems like we are going with a token bucket approach (#15516) to
rate-limit introduce cells, even tho the rest of the Tor protocol is
using SENDME cells. Are we reinventing the wheel here?

> > That being all said, our short-term goal here is to add INTRODUCE2
> rate-limiting (similar to the Guard DoS subsystem deployed early last year)
> *at* the intro point but much simpler. The goal is to soak up the introduction
> load directly at the intro points which would help reduce the load on the
> network overall and thus preserve its health.
>

== We need to understand the effects of this feature: 

First of all, the main thing to note here is that this is a feature that
primarily intends to improve network health against DoS adversaries. It
achieves this by greatly reducing the amount of useless rendezvous
circuits opened by the victim service, which then improves the health of
guard nodes (when guard nodes breaks, circuit start retrying endlessly,
and hell begins).

We don't know how this feature will impact the availability of an
attacked service. Right now, my hypothesis is that even with this
feature enabled, an attacked service will remain unusable. That's
because an attacker who spams INTRO1 cells will always saturate the
intro point and innocent clients with a browser will be very unlikely to
get service (kinda like sitting under a waterfall and trying to fill a
glass with your spit). That said, with this defense, the service won't
be 100% CPU, so perhaps innocent clients who manage to sneak in will get
service, whereas now they don't anyhow.

IMO, it's very important to understand exactly how this feature will
impact the availability of the service: If this feature does not help
the availability of the service, then victim operators will be
incentivized to disable the feature (or crank up the limits) which means
that we will not improve the health of the network, which is our primary
goal here.

---

== Why are we doing all this?

Another thing I wanted to mention here is the second order effect we are
facing. The only reason we are doing all this is because attackers are
incentived into attacking onion services. Perhaps the best thing we
could do here is to create tools to make denial of service attacks less
effective against onion services, which would make attackers stop
performing them, and hence we won't need to implement rate-limits to
protect the network in case they do. Right now the best things we have
towards that direction is the incomplete-but-plausible design of [0] and
the inelegant 1b from [1].

This is especially true since to get this rate-limiting feature deployed
to the whole network we need all relays (intro points) to upgrade to the
new version so we are looking at years in the future anyway.

[0]: https://lists.torproject.org/pipermail/tor-dev/2019-May/013849.html
     https://lists.torproject.org/pipermail/tor-dev/2019-June/013862.html
[1]: https://lists.torproject.org/pipermail/tor-dev/2019-April/013790.html

>
> One naive approach is to see how much cells an attack can send towards a
> service. George and I have conducted experiment where with 10 *modified* tor
> clients bombarding a service at a much faster rate than 1 per-second (what
> vanilla tor does if asked to connect a lot), we see in 1 minute ~15000
> INTRODUCE2 cells at the service. This varies in the thousands depending on
> different factors but overall that is a good average of our experiment.
>
> This means that 15000/60 = 250 cells per second.
>
> Considering that this is an absurd amount of INTRODUCE2 cells (maybe?), we can
> put a rate per second of let say a fifth meaning 50 and a burst of 200.
>
> Over the normal 3 intro points a service has, it means 150 introduction
> per-second are allowed with a burst of 600 in total. Or in other words, 150
> clients can reach the service every second up to a burst of 600 at once. This
> probably will ring alarms bell for very popular services that probably gets
> 1000+ users a second so please check next section.
>
> I'm not that excited about hardcoded network wide values so this is why the
> next section is more exciting but much more work for us!
>

Yes, I'm also very afraid of imposing network wide values here. What
happens to hypothetical onion services that outperform the hard limits
we impose here, even when they are not DoSed? The limits above are
extremely low when we are looking at normal busy websites on the
clearnet, so by activating them we are basically putting hard limits to
the adoption of onion services.

Perhaps that's something we want to do anyway, because not knowing how
many clients an onion service can support is also not ideal, but we
should really think twice (and then again twice) before doing it and
also talk to some people who manage busy sites in the onionspace and
outside of it.

== What about false positives?

Also given that the rate limiting happens on the intro point layer here,
how does a service learn that it's getting DoSed? Are we looking at a
special IP->HS cell that says "we are throttling your clients"? How much
to overengineer here?

== What's the ideal client behavior when the limit gets hit?

So given that these hard limits can be hit quite easily by an attacker,
what is the client behavior when they get hit? Will normal clients keep
on retrying intro points until they get service, and continuously
extending their circuits? This behavior is particularly important for
the availability of the service under this feature.

---

These are some thoughts I have about this. As you can see I'm also
confused and thinking about this topic :)