Re: [tor-dev] Onion Service - Intropoint DoS Defenses

3 Jul 2019

      On 30 May (09:49:26), David Goulet wrote:
...
Greetings!
[snip]

Hi everyone,

I'm writing here to update on where we are about the introduction rate
limiting at the intro point feature.

The branch of #15516 (https://trac.torproject.org/15516) is ready to be merged
upstream which implements a simple rate/burst combo for controlling the amount
of INTRODUCE2 cells that are relayed to the service.

As previously detailed in this thread, the default values are a rate of 25
introduction per second and a burst of 200 per second. These values can be
controlled by consensus parameters meaning they can be changed network wide.

We've first asked big service operators, I'm not going to detail the values
they provided us in private, but those defaults are quite large enough to
sustain heavy traffic from what we can tell from what they gave us.

The second thing we did is do experimental testing to see how CPU usage and
availability is affected. We've tested this with 3 _fast_ introduction points
and then 3 rate limited introduction points.

The good news is that once the attack stops, the rain of introduction requests
to the service stops very quickly.

With the default rate/burst values, on a Intel(R) Xeon(R) CPU E5-2650 v4 @
2.20GHz (8 cores), the tor service CPU doesn't go above ~60% (on one single
core). And almost drops to 0 as soon as the attack ends.

The bad news is that availability is _not_ improved. One of the big reasons
for that is because the rate limit defenses, once engaged at the intro point,
will send back a NACK to the client. A vanilla tor client will stop using that
introduction point away for 120 seconds if it gets 3 NACKs from it. This leads
to tor quickly giving up on trying to connect and thus telling the client that
connection is impossible to the .onion.

We've hacked a tor client to play along and stop ignoring the NACKs to see how
much time it would take to reach it. On average, a client would roughly need
around 70 seconds with more than 40 NACKs on average.

However, it varied a _lot_ during our experiments with many outliers from 8
seconds with 1 NACK up to 160 seconds with 88 NACKs. (For this, the
SocksTimeout had to be bumped quite a bit).

There is an avenue of improvement here to make the intro point sends a
specific NACK reason (like "Under heavy load" or ...) which would make the
client consider it like "I should retry soon-ish" and thus making the client
possibly able to connect after many seconds (or until the SocksTimeout).

Another bad news there! We can't do that anytime soon because of this bug that
basically crash clients if an unknown status code is sent back (that is a new
NACK value): https://trac.torproject.org/30454. So yeah... quite unfortunate
there but also a superb reason for everyone out there to upgrade :).

One good news is that it seems that having fast intro points instead of slow
IPs doesn't change much on the overall load on the service so this for now,
our experiment, shows it doesn't matter.

Overall, this rate limit feature does two things:

1. Reduce the overall network load.

   Soaking the introduction requests at the intro point helps avoid the
   service creating pointless rendezvous circuits which makes it "less" of an
   amplification attack.

2. Keep the service usable.

   The tor daemon doesn't go in massive CPU load and thus can be actually used
   properly during the attack.

The problem with (2) is the availability part where for a legit client to
reach the service, it is close to impossible for a vanilla tor without lots of
luck.  However, if let say the tor daemon would be configured with 2 .onion
where one is public and the other one is private with client authorization,
then the second .onion would be totally usable due to the tor daemon not being
CPU overloaded.

As a third thing we did about this. In order to make this feature a bit more
"malleable", we are working on https://trac.torproject.org/30924 which is
proposal 305.

In short, torrc options are added so an operator can change the rate/burst
that the intro points will use. We can do that using the ESTABLISH_INTRO cell
that will have an extension to define the DoS defense parameters (proposal
305).

That way, a service operator can disable this feature, or turn the knobs on
the rate/burst in order to basically adjust the defenses.

At this point in time, we don't have a good grasp on what happens in terms of
CPU if the rate or the burst is bumped up or even how availability is
affected. During our experimentation, we did observed a "sort of" linear
progression between CPU usage and rate. But we barely touched the surface
since it was changed from 25 to 50 to 75 and that is it.

We would require much more experimentation which is something we want to avoid
as much as possible on the real network.

Finally, many more changes are cooking up. One in particular is
https://trac.torproject.org/projects/tor/ticket/26294 that will make tor to
only rotate its intro points when the number of introduction requests is
between 150k to 300k (random value) which currently is between 16k and 32k.
See the ticket for the benefits here which mostly helps with (1).

There has been much talk about a client PoW (see the proposal 305 thread on
this list) which in theory would help out with service availability.

We will also soon merge upstream this ticket https://trac.torproject.org/24962
which goes one step further at denying single-hop connections to the
HSDir/Intro in order to try as much as possible to shutdown the Tor2web
connections (or any attacker that speeds things up on their side by single
hoping).

We are making progress here... This is really a non trivial problem and
solution for service availability are not that simple. Our priority is to
protect the network as much as possible and then move to possible solutions
for availability.

I'll stop for now. Huge thanks to everyone who provided service logs, ideas,
code review and future testers :).

Cheers!
David

-- 
ccaxzx2hoGOJKo8/00JcH6h3YBw/9SJzFt8yQ65rl9Y=