[tor-dev] Onion Service - Intropoint DoS Defenses

Fri Jun 7 13:12:23 UTC 2019

On 06 Jun (20:03:52), George Kadianakis wrote:
> David Goulet <dgoulet at torproject.org> writes:
> 
> > Greetings!
> >
> > <snip>
> >
> 
> Hello, I'm here to brainstorm about this suggested feature. I don't have
> a precise plan forward here, so I'm just talking.
> 
> > Unfortunately, our circuit-level flow control does not apply to the
> > service introduction circuit which means that the intro point is
> > allowed, by the Tor protocol, to send an arbitrary large amount of cells
> > down the circuit.  This means for the service that even after the DoS
> > has stopped, it would still receive massive amounts of cells because
> > some are either inflight on the circuit or queued at the intro point
> > ready to be sent (towards the service).
> > 
> 
> == SENDME VS Token bucket
> 
> So it seems like we are going with a token bucket approach (#15516) to
> rate-limit introduce cells, even tho the rest of the Tor protocol is
> using SENDME cells. Are we reinventing the wheel here?

I see these as two different approaches.

Relying on the flow control protocol here is nice in practice because the
intro point would not relay anything until the service asks for more data. But
this can often be influenced by the circuit latency. It could be that the
service could handle 10 times what it received but because the SENDME takes 2
second to reach the intro point, then we loose precious "work time".

I think if we rely on the flow control, it will severely impact very popular
hidden service that have a nice OnionBalance setup and all. I have no numbers
to back that up but that is my intuition.

The token bucket approach is more flexible _especially_ with the idea of
having ESTABLISH_INTRO cell having parameters for the token bucket knobs.

> 
> > > That being all said, our short-term goal here is to add INTRODUCE2
> > rate-limiting (similar to the Guard DoS subsystem deployed early last year)
> > *at* the intro point but much simpler. The goal is to soak up the introduction
> > load directly at the intro points which would help reduce the load on the
> > network overall and thus preserve its health.
> >
> 
> == We need to understand the effects of this feature: 
> 
> First of all, the main thing to note here is that this is a feature that
> primarily intends to improve network health against DoS adversaries. It
> achieves this by greatly reducing the amount of useless rendezvous
> circuits opened by the victim service, which then improves the health of
> guard nodes (when guard nodes breaks, circuit start retrying endlessly,
> and hell begins).
> 
> We don't know how this feature will impact the availability of an
> attacked service. Right now, my hypothesis is that even with this
> feature enabled, an attacked service will remain unusable. That's
> because an attacker who spams INTRO1 cells will always saturate the
> intro point and innocent clients with a browser will be very unlikely to
> get service (kinda like sitting under a waterfall and trying to fill a
> glass with your spit). That said, with this defense, the service won't
> be 100% CPU, so perhaps innocent clients who manage to sneak in will get
> service, whereas now they don't anyhow.
> 
> IMO, it's very important to understand exactly how this feature will
> impact the availability of the service: If this feature does not help
> the availability of the service, then victim operators will be
> incentivized to disable the feature (or crank up the limits) which means
> that we will not improve the health of the network, which is our primary
> goal here.

This is an experiment we can easily run. Saturate a service intro points (that
we control) and run in a loop a client trying to reconnect. See the success
rate. I'm also expecting very very very low reachability but who knows, we
could be surprised but at least we'll have data points.

> 
> ---
> 
> == Why are we doing all this?
> 
> Another thing I wanted to mention here is the second order effect we are
> facing. The only reason we are doing all this is because attackers are
> incentived into attacking onion services. Perhaps the best thing we
> could do here is to create tools to make denial of service attacks less
> effective against onion services, which would make attackers stop
> performing them, and hence we won't need to implement rate-limits to
> protect the network in case they do. Right now the best things we have
> towards that direction is the incomplete-but-plausible design of [0] and
> the inelegant 1b from [1].
> 
> This is especially true since to get this rate-limiting feature deployed
> to the whole network we need all relays (intro points) to upgrade to the
> new version so we are looking at years in the future anyway.
> 
> [0]: https://lists.torproject.org/pipermail/tor-dev/2019-May/013849.html
>      https://lists.torproject.org/pipermail/tor-dev/2019-June/013862.html
> [1]: https://lists.torproject.org/pipermail/tor-dev/2019-April/013790.html

My two cents here are that all those features could complement each other over
time. Having a proof-of-work + rate limit can work well together.

But at this juncture in time, what I want most to be fixed is the fact that
service are used for an amplification attack. This was disastrous during the
2018 DDoS, saturating Guard nodes constantly. We fixed this by adding DoS
defenses at the Guard level which stopped the client madness, but not the
service side of things.

Soaking the huge loads on the intro point is a good easy avenue for us to
pursue and have very direct impact on the health of the network. And it is
always something we can disable with a consensus parameterse if shit hit the
fan with it.

> 
> >
> > One naive approach is to see how much cells an attack can send towards a
> > service. George and I have conducted experiment where with 10 *modified* tor
> > clients bombarding a service at a much faster rate than 1 per-second (what
> > vanilla tor does if asked to connect a lot), we see in 1 minute ~15000
> > INTRODUCE2 cells at the service. This varies in the thousands depending on
> > different factors but overall that is a good average of our experiment.
> >
> > This means that 15000/60 = 250 cells per second.
> >
> > Considering that this is an absurd amount of INTRODUCE2 cells (maybe?), we can
> > put a rate per second of let say a fifth meaning 50 and a burst of 200.
> >
> > Over the normal 3 intro points a service has, it means 150 introduction
> > per-second are allowed with a burst of 600 in total. Or in other words, 150
> > clients can reach the service every second up to a burst of 600 at once. This
> > probably will ring alarms bell for very popular services that probably gets
> > 1000+ users a second so please check next section.
> >
> > I'm not that excited about hardcoded network wide values so this is why the
> > next section is more exciting but much more work for us!
> >
> 
> Yes, I'm also very afraid of imposing network wide values here. What
> happens to hypothetical onion services that outperform the hard limits
> we impose here, even when they are not DoSed? The limits above are
> extremely low when we are looking at normal busy websites on the
> clearnet, so by activating them we are basically putting hard limits to
> the adoption of onion services.
> 
> Perhaps that's something we want to do anyway, because not knowing how
> many clients an onion service can support is also not ideal, but we
> should really think twice (and then again twice) before doing it and
> also talk to some people who manage busy sites in the onionspace and
> outside of it.

They need to be at least consensus parameters so the entire network can adapt
if the default values ends up being very bad or worst, inneffective.

Second thing is that I'm thinking more and more that this feature is not
complete/useful without a way for the service operator to have control over
those knobs. Fortunately, we have #30790 in the pipe for this.

> 
> == What about false positives?
> 
> Also given that the rate limiting happens on the intro point layer here,
> how does a service learn that it's getting DoSed? Are we looking at a
> special IP->HS cell that says "we are throttling your clients"? How much
> to overengineer here?

For now, it would be unnoticed by the operator for which I'm not that worried
about. Likely scenario here is that users starts complaining to the service
operator that they can't reach it.

> 
> == What's the ideal client behavior when the limit gets hit?
> 
> So given that these hard limits can be hit quite easily by an attacker,
> what is the client behavior when they get hit? Will normal clients keep
> on retrying intro points until they get service, and continuously
> extending their circuits? This behavior is particularly important for
> the availability of the service under this feature.

The code right now, in #15516, will send a NACK. The reason for this is
because we want legit client to re-extend and not create a new intro circuit.
More efficient and less pressure on the network.

After getting NACKed by all introduction points, the client will stop
retyring. It will be allowed to retry when the "failure cache" cleans up which
is right now 5 minutes time out. Or if new intro point are found in a new
descriptor.

I'm in favor of the re-extend option here which is the normal behavior client
will encounter in normal circumstances. And also the one that creates less
pressure.

Cheers!
David

-- 
1mNwEGRBGwA+KV0QAcUNjXeckIqFcZmhiwewCdRWKac=
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 488 bytes
Desc: not available
URL: <http://lists.torproject.org/pipermail/tor-dev/attachments/20190607/4d9d84bb/attachment.sig>