[tor-dev] Denial of service defences for onion services

Tue Apr 30 17:12:09 UTC 2019

Hello list,

This is a thread summarizing and brainstorming various defences about denial of
service defences for onion services after an in-depth discussion with David Goulet.

We've been thinking about denial of service defences for onion services
lately. This has been a recurrent topic that has been creeping up every once in
a while: Last time we had to tackle this issue it was back in early 2018 when
we had to design a DoS mitigation subsystem because the network was crumbling
down (https://trac.torproject.org/projects/tor/ticket/24902).

Unfortunately, while the DoS mitigation subsystem improved the health of the
network and stopped the DoS attacks back then, it did not address the total
space of possible attacks, and onion services and the network is still open to
various attacks. The main DoS attack right now is the naive attack of flooding
the service with too many introduction requests, and this is the attack that
this post is gonna be dealing with.

We don't like DoS attacks because they cause two issues to Tor:

   a) They damage the health of the Tor network impacting every user
   b) They kill availability of legitimate onion services.

In this thread we will handle these two issues independently, as there is no
single solution that improves both areas at once. We have some pretty good
ideas on (a), but we would appreciate ideas on (b), so feel free to give us
your input.

== a) Minimizing the damage to the network caused by DoS attacks:

   Most of the damage caused during DoS attacks is from the circuits created by
   the attacker to introduce/rendezvous to the victim onion service, and also
   by the circuits created by the victim onion service as it tries to
   rendezvous with all those clients. An attacker can literally create tens of
   thousands of introduction circuits in less than a minute, which get
   amplified by the service launching that many rendezvous circuits. Not good.

   Here are a few ways to reduce the damage to the network:

   == 1) Rate limiting introduction circuits

      There should be a way to rate-limit introductions so that services do not
      get overwhelmed. There are various places where we can rate-limit: we
      could rate-limit on the guard-layer, or on the intro-point layer or on
      the service-layer.

      We have already attempted at rate-limiting on the guard-layer with
      #24902, but it's hard to go deeper there because the guard does not know
      if the circuit is a DoS attacker, or a busy onion service, or 150 Tor
      users in an airport. We also think that rate-limiting on the
      service-layer won't do much good since that's too far down the circuit,
      and we are trying to reduce the operations it has to do so that it
      doesn't get overwhelmed (see #15463 for various queue-management
      approaches for rate-limiting on the service side).

      So we've been thinking of rate-limiting on the introduction point layer,
      since it's a nice soaking point that does not do much right now. See
      #15516 (comment 28) for a concrete proposal by arma which results in far
      less damage to the network (since evil traffic does not get carried
      through to the service-side introduction circuit, and no extra rendezvous
      circuits get launched), and also a swifter way for legit clients to know
      that an onion-service circuit won't work.

   == 2) Stop needless circuit rotation on service-side

      Right now, services will rotate their introduction circuits after a
      certain number of introductions (#26294). This means that during an
      attack, the service not only needs to handle thousands of fake
      introduction circuits, but also continuously tear down and recreate
      introduction circuits and publish new descriptors. See comment 8 on that
      ticket for a short-term proposal on how to improve the situation here,
      by not continuously rotating introduction points.

   == 3) Optimize CPU performance on the service-side

      Right now, onion services during an attack are actually CPU bound. See
      #30221 for various improvements we can do to improve the performance of
      services. However, improving CPU performance might have the opposite effect,
      since processing cells quicker means that the service will make even more 
      rendezvous circuits.

   == 4) Make sure attackers don't take shortcuts around the protocol

      We should make sure that attackers don't take shortcuts around the Tor
      protocol to launch their attacks. Examples here involve requiring a
      proof-of-rendezvous from clients (#25066), and not allowing single-hop
      proxies to do introductions (#22689).

   The above suggestions (maybe in priority order) are ways we can improve the
   damage dealt to the network by DoS attackers. But that still does not make
   DoS attacks less effective. So here follows the section about improving
   service availability:

== b) Improve service availability during DoS attacks

   Unfortunately, it's really hard to accurately stop DoS attacks in the Tor
   protocol. There is just no good way to distinguish between innocent clients
   trying to access content, and a bad actor trying to disable an onion service.
   Here is the main way we've thought of addressing this issue:

   == 1) Binding the application-layer with the Tor introduction-layer

     We think that the Tor protocol layer might not be the right place for
     handling DoS attacks. There are literally million-dollar companies trying
     hard to tackle this issue on the application-layer, where it's easier
     since you can do machine learning, give out captchas, zone out users,
     etc. And that's why we think that the solution to this issue lies on the
     application-layer and not on the Tor protocol layer.

     In particular, a plausible solution here might involve for the client to
     embed application-layer information (e.g. a username/password) in its
     INTRODUCE1 cell, which then gets passed to the service. The service, can
     then check whether the given username/password should be allowed to
     connect (see "rendezvous approver" concept at #16059), and allow or reject
     the connection as it wishes. This way onion service operators can have
     complicated application-layer software that analyzes the activity of users
     and decide whether users should be allowed in or not (based on the number
     of introductions, or their application-layer (web) activity).

              +===========================================+
              |                Tor network                |
              +===========================================+
                ^                                   ^
                |         +-----+                   |
                +-------->| Tor |-------------------+
                  INTRO2  |  HS |  rendezvous circuit
                   with   +-----+    only if approved
                 user/pass   ^
                             |
                             |
                             v
                          +----------+         +-------+
                          |Rendezvous|<------->|sqlite?|
                          |approver  |         +-------+
                          +----------+

     We think that this is a solution that could allow onion services to
     continue existing under high-load scenarios, since no rendezvous circuits
     would be established during DoS scenarios (and we know that rendezvous
     circuits is what causes the most CPU/network/availability damage).

     However, this is a very complicated solution from an engineering
     perspective given that it requires changes on the client-side (to enhance
     INTRO1 cells with application-layer data), and also involves various
     enhancements on the service-side (various control port commands to
     interact with the (nonexistent) "rendezvous approver" software, which in
     turn needs to interact with other application-layer software (e.g. sql 
     databases to manage membership).

     There is also serious UX concerns with how this would look like on the
     client-side? Also, how does this interact with client auth? And how does
     this interact with intro-point-level rate limiting proposed above
     (onions should be given the option to disable intro-layer rate limiting)?
     How is this related to #17254?

All in all, we feel like we have pretty good options for reducing the
damage that DoS attacks cause on our network, but we are still lacking
easy and practical solutions for ensuring availability of onion services
that are under DoS. For the next months, we plan to focus on reducing
the damage on the network, since the damage on the network has a
cummulative effect as circuits fail and get endlessly retried, where
nothing ends up working right. At the same time, we will be thinking of
good solutions for keeping a high availability on services that receive
DoS attacks.

We would love your feedback and suggestions.

Thanks!