Hi everybody,
Thanks for your patience. Here is quick update -- hopefully we'll have another update in the upcoming days too.
On Sat, Dec 30, 2017 at 06:25:28PM -0500, Roger Dingledine wrote:
(0) Thanks everybody for your work keeping the network going in the meantime! I see that the total number of relays has dropped off a tiny bit: https://metrics.torproject.org/networksize.html but the overall capacity and load on the network has stayed about the same: https://metrics.torproject.org/bandwidth.html So I wouldn't say the sky is falling at this point.
This part is still true! :)
(1) I don't currently have any reason to think this is an intentional denial-of-service attack.
Actually, I now think there is an intentional component to it. But it's not as straightforward as we might have thought.
I think the pain started because somebody is trying to overload a set of onion services with rendezvous requests. But the real pain for the network as a whole comes when those onion services try to keep up with responding to the rendezvous requests.
Counterintuitively, by generating so many response circuits on the network, they're actually loading down the network enough that many of their response attempts will fail.
For one concrete example, when a v2 (that is, non-nextgen) onion service is building its response rendezvous circuit, the last hop in that circuit (the one to the rendezvous point) uses the old "TAP" circuit handshake, which takes a lot more cpu and is given much lower priority by that relay. So if people are flooding the relay with a bunch of circuit create requests, it will take an extra long time to get around to processing the TAP cell, which is part of why their rendezvous circuits are failing. That explanation also matches how people here observed a spike in TAP cells on their relays.
(2b) If anybody has great contacts at Hetzner or OVH and can help us get a message to whoever is running these clients, that would be grand. ("Hi, did you know that you're hurting the Tor network? The Tor people would love to talk to you to help you do whatever it is you're trying to do, in a less harmful way.")
We talked to some OVH abuse people who are Tor fans, who requested that we file a formal abuse ticket asking for contact. I did, and they passed it on to "the customer", but then the OVH Tor clients mysteriously vanished a few days later, with as far as I can tell no attempts at contact. https://metrics.torproject.org/userstats-relay-country.html?start=2017-10-15...
The Hetzner clients still remain so far: https://metrics.torproject.org/userstats-relay-country.html?start=2017-10-15... and we've actually heard from some of them, who are onion service operators trying to keep up with the load.
But the number of people we have heard from only explains a tiny fraction of the "million plus" new users in Germany, so there are still some good mysteries left.
But again, it seems that (some of) these connections from OVH and Hetzner aren't really the origin of the problem. So defenses that focus only on stopping these "attacks" are leaving out a big piece of the puzzle.
(3) I took some steps on Dec 22 to reduce the load that these clients (well, all clients) are putting on the network in terms of circuit creates. It seems like maybe it helped a bit, or maybe it didn't, but I'm the only one who has posted any stats for comparison. You can read more here: https://trac.torproject.org/24716
Alas, I think these consensus param changes didn't make a huge difference. We still have the main change in place, but I plan to try backing it out sometime soon, to see if we see any difference.
The other directions we're working on fall into four categories:
A) Bugfixes and design changes to help onion services not overload the network when they're trying to respond to so many requests. That is, ways to make them more efficient at responding to the most actual users with the fewest wasted circuits.
B) Ways to block or throttle jerks who are trying to overload the onion services. I actually think I have good way to do it for this particular attack, but I'd like to work harder to be a few steps ahead in the arms race first -- that is, move from "bump out these jerks" to "make it harder to use the Tor internal protocols for amplification attacks".
C) Mitigations that relays can use to be more fair with their available resources. This one is actually quite tough from a design perspective, because if one relay is really fast, meaning it could handle all of the create cells it receives, maybe it should nonetheless opt to fail some of them, for the good of later circuits in those circuits.
D) Talking to the humans involved to try to get them to stop and/or make things less bad, in the mean time.
Hope that helps explain. More soon as we learn more and/or as we merge in defenses and/or as we get permission to share things from the people who have told us things.
Thanks, --Roger