[tor-dev] Proposal: The move to two guard nodes

Fri Apr 13 06:04:09 UTC 2018

On Sat, Mar 31, 2018 at 06:52:51AM +0000, Mike Perry wrote:
>   The main argument for switching to two guards is that because of Tor's
>   path restrictions, we're already using two guards, but we're using them
>   in a suboptimal and potentially dangerous way.
> 
>   Tor's path restrictions enforce the condition that the same node cannot
>   appear twice in the same circuit, nor can nodes from the same /16 subnet
>   or node family be used in the same circuit.
> 
>   Tor's paths are also built such that the exit node is chosen first and
>   held fixed during guard node choice, as are the IP, HSDIR, and RPs for
>   onion services. This means that whenever one of these nodes happens to
>   be the guard[4], or be in the same /16 or node family as the guard, Tor
>   will build that circuit using a second "primary" guard, as per proposal
>   271[7].
> 
>   Worse still, the choice of RP, IP, and exit can all be controlled by an
>   adversary (to varying degrees), enabling them to force the use of a
>   second guard at will.

I agree with you that we should do something about this bug, where Tor
clients will switch to a rarely used guard in some situations. Our fix
from ticket #14917 was not a good fix. More on that below in Section 3.1.

>   Not surprisingly,
>   the two guard adversary gets to compromise clients roughly twice as
>   quickly, but the timescales are still rather large even for the 10%
>   adversary: they only have 50% chance of success after 4 rotations, which
>   will take about 14 months with Tor's 3.5 month guard rotation.

Three thoughts here:

(A) You're right, 14 months doesn't sound bad here.

(B) This calculation was ignoring churn, right? That is, guards going
away before you wanted to rotate from them. So another way to phrase that
would be "once eight of your guards have gone away, you're in bad shape"?
Looking at it that way, it seems like two guards is more than twice
as scary as one, since *either* of them going away moves you one step
closer on the path. Not the end of the world, but worth noticing. And
maybe partially solvable by your "when one of your two goes away, stick
to the remaining one" design; more on that below.

(C) Similarly, we should be sure to remember the network adversary
here too. I don't know a simple way to reason about it well. Using more
guards over time could be *less* than twice as scary, because sometimes
the network paths overlap so you don't expose as much new surface area
as you might have. And using more guards over time could be *more*
than twice as scary, if the question is whether your traffic ever goes
over that one bad place, since you have an exponentially low chance to
*never* pick a guard where your traffic to/from that guard travels over
the bad place. It really depends on your location, the guard locations,
the Internet topology, and a bunch of other confusing factors.

>   Furthermore, our use of separate directory guards (and three of them)
>   means that we're not really changing the situation much with the
>   addition of another regular guard. Right now, directory guard use alone
>   is enough to track all Tor users across the entire world.

Shit, you're right. The guard set fingerprint issue remains right now,
because we never solved the directory guard side of it. :(

>   While the directory guard problem could be fixed[12] (and should be
>   fixed), it is still the case that another mechanism should be used for
>   the general problem of guard-vs-location management[9].

The part that freaks me out about all the designs I've seen here is the
attack where the local adversary advertises a series of local wireless
addresses, first to make you keep generating new guard contexts (similar
to forcing quick guard rotation), or second to guess-and-check whether
you've already got a guard context for some wireless address in the next
city over. Maybe it can be solved by proper UI ("we'll just delegate
the decision to the user"), but hoo boy. But that's a separate proposal
fortunately. :)

> 3.1. Eliminate path restrictions entirely
> 
I'm increasingly a fan of this option, the more I read these threads.

Let's examine the two attacker assumptions behind two of the attacks
we're worried about.

Attack one: the client's local ISP collects coarse netflow logs, and these
logs aren't detailed enough to allow a traffic volume detection attack on
an existing long-lived TLS flow, so the connection to that first guard
is safe; but a connection to that second guard will be unusual and not
multiplexed and at exactly the time of the adversary-controlled circuit
that triggered it, so that second guard, because it is used so rarely,
is dangerous to use.

Attack two: if the client uses its guard as the first hop of its circuit
and also the adversary-requested fourth hop, then the guard can do
pairwise traffic correlation attacks on all of its circuits and realize
that these two circuits it has are really two pieces of the same circuit.

This second attack seems weird to me. One reason is because in attack
one we're brushing aside the traffic analysis as hard, whereas in attack
two we're assuming it's trivial and perfect. But the simpler reason is:
if your guard is going to participate in a traffic correlation attack
against you, then it could just as easily team up with some other relay
that the adversary picked. That is, avoiding reusing your guard on the
other end of the circuit isn't going to save you if your guard is out
to get you.

Part of why it's hard to compare these two attacks directly is because
one is a client-side-observer adversary and the other is a relay-level
adversary.

Let's look at "attack one" from a relay-level-adversary perspective:
if your first guard is bad, you're screwed already. But if that second
guard might be bad, you really want to do anything you can do to not
reach out to it even once.

And "attack two" from the client-side-observer-level-adversary
perspective: well, if the attacker is watching the *client*, there's
no visible hint that it's reusing its guard later in the path -- and
that's the whole point. But if the attacker is watching the *relay*, then
suddenly we don't have as much diversity of traffic location as we thought
we had. That is, even if your relay is nice, somebody watching the relay's
network could do the pairwise correlation attacks we described earlier.

Another part of what bothers me about attack two -- the one where the
adversary gives you your fourth hop -- is that the adversary has *other*
hops in their side of the circuit, and you don't even know about them.
What if they chose your guard for their middle hop? Or for *their*
guard? There's nothing you can do about those cases, because you can't
know that they're happening. My conclusion is that if we can't solve
significant instances of this attack, we should be wary of paying a
large price to solve only a piece of it.

>   If Tor decided to stop enforcing /16, node family, and also allowed the
>   guard node to be chosen twice in the path, then under normal conditions,
>   it should retain the use of its primary guard.

To be clear, the design I've been considering here is simply allowing
reuse between the guard hop and the final hop, when it can't be avoided. I
don't mean to allow the guard (or its family) to show up as all four
hops in the path. Is that the same as what you meant, or did you mean
something more thorough?

I think "can't be avoided" means HSDir, IP, RP -- which I note are all
onion service related circuits.

I'd like to hear more about the "cleverly crafted exit policy" attack, and
I wonder if we can't solve that differently. For example, if it's about
making you do a request to a port that only one exit relay allows, and
ha ha whoops your guard was on the same /16 as that exit relay... maybe
it's time for the dir auths to not advertise super rare ports? This was
one of the topics in the users-get-routed paper too.

One non-starter idea would be to move onion-service-related Tors to two
guards, and leave other Tors at one guard. It's a non-starter because of
course advertising which you are to your local network is no good. But
that idea gave me a different perspective on this discussion: I wonder
how much this design decision comes down to making all Tors use two
guards in order to protect the onion-service-related Tors, which are
the only ones who actually need it?

>   This approach is not as extreme as it seems on face. In fact, it is hard
>   to come up with arguments against removing these restrictions. Tor's
>   /16 restriction is of questionable utility against monitoring, and it can
>   be argued that since only good actors use node family, it gives influence
>   over path selection to bad actors in ways that are worse than the benefit
>   it provides to paths through good actors[10,11].

Yep.

One remaining feature for MyFamily though is that relay operators can say
"No, even though I run these eight relays, I'm not in a position to do
traffic correlation attacks on users, because I told the users to not
put me in that position." This angle of the feature is about protecting
relays, not about protecting clients.

>   However, while removing path restrictions will solve the immediate
>   problem, it will not address other instances where Tor temporarily opts
>   use a second guard due to congestion, OOM, or failure of its primary
>   guard, and we're still running into bugs where this can be adversarially
>   controlled or just happen randomly[5].

I continue to think we need to fix these. I'm glad to see that George
has been putting some energy into looking more at them. The bugs that
we don't understand are especially worrying, since it's hard to know
how bad they are. Moving to two guards might put a bit of a bandaid on
the issues, but it can't be our long-term plan for fixing them.

>   Note that for this analysis to hold, we have to ensure that nodes that
>   are at RESOURCELIMIT or otherwise temporarily unresponsive do not cause
>   us to consider other primary guards beyond than the two we have chosen.
>   This is accomplished by setting guard-n-primary-guards to 2 (in addition
>   to setting guard-n-primary-guards-to-use to 2). With this parameter
>   set, the proposal 271 algorithm will avoid considering more than our two
>   guards, unless *both* are down at once.

I like this general idea of not immediately replacing guards so long as
you have a working one. In fact, we used to do something similar back
in the day:
https://blog.torproject.org/improving-tors-anonymity-changing-guard-parameters
says (emphasis mine)
"""
Tor 0.2.3's entry guard behavior is "choose three guards, ***adding
another one if two of those three go down*** but going back to the
original ones if they come back up, and also throw out (aka rotate)
a guard 4-8 weeks after you chose it."
"""

There are still some fiddly decisions to make here. For example, as you
say we probably shouldn't replacement a guard just because we failed to
connect to one of our guards once. We might decide that it's time to add
a new second guard if the consensus tells us that one of them is down
(so we have confirmation that it isn't down for just us, it's down for
everybody). Or we might decide to wait on adding a new one even if it
really is down, because maybe it'll come back soon. But how long do
we wait? And if, while we're down to one, we encounter one of these
situations where the requested fourth hop overlaps with our remaining
guard, what do we do?

In fact, here's a hopefully useful insight that I've just realized:
you're not concerned about one guard vs two guards, you're concerned
about *transitioning* between guards. It's that moment when you're
starting to use a new guard, if the attacker can observe that you're
doing it, and especially if the attacker can make you do it, that is
vulnerable. And starting with two guards can help, in that it postpones
the time until you're forced to transition, and maybe also because if
we do it right it can make the transition less visible.

But I wonder if we're looking at this backwards, and the primary
question we should be asking is "How can we protect the transition between
guards?" Then one of the potential answers to consider is "Maybe we should
start out with two guards rather than just one." Framing it that way,
are there more options that we should consider too? For example, removing
the ability of the non-local attacker to trigger a transition? Then
there would still be visibility of a transition, but the (non-local)
attacker can't impact the timing of the transition. How much does that
solve? Need to think more.

> 3.2. No Guard-flagged nodes as exit, RP, IP, or HSDIRs
> 
>   Similar to 3.1, we could instead forbid the use of Guard-flagged nodes
>   for the exit, IP, RP, and HSDIR positions.
> 
>   This solution has two problems: First, like 3.1, it also does not handle
>   the case where resource exhaustion could force the use of a second
>   guard. Second, it requires clients to upgrade to the new behavior and
>   stop using Guard flagged nodes before it can be deployed.

I'm not much of a fan of this approach (it seems so inelegant!), but
I find the two problems that you identified to be unsatisfying for
ruling it out. I wonder if we can find some stronger arguments against
this approach?

Otherwise I might find myself starting to like it. :)

One stronger argument might be: "the attacker can always use Guard-flagged
nodes for other hops on its half of the circuit, and you wouldn't even
be able to know that it's doing it, so if the goal is to never have a
circuit with your guard both at your end and also reused elsewhere in
the circuit, sorry you can't achieve that goal, so stop messing stuff
up while trying to achieve what can only ever be a partial solution."

> 4. The future is confluxed
> 
>   An additional benefit of using a second guard is that it enables us to
>   eventually use conflux[6].

I think the performance benefits are the main arguments in favor
of doing two guards. In fact, I still think that it's mainly a
performance-vs-safety tradeoff.

I agree with George that moving to two guards now so that we can maybe
do Conflux later is doing it the wrong way round. Since it's so easy
to switch to two guards, that should be one of the very easy steps in
moving to Conflux when we do, and taking the safety hit now in exchange
for the potential performance benefit later doesn't seem best.

But there's another performance argument we shouldn't forget: if you have
two guards, you're much more likely to have at least one guard that's
adequately fast. Right now some of the guards are fast (relative to
others), and some are slow (relative to others). If you get one of the
lower-end guards, your Tor performance is sad -- for months! We tried
to mitigate that issue when we switched to one guard, by raising the
required bandwidth to get the Guard flag, so there would be no truly
terrible guards. But still, some guards are more equal than others.

This issue came up especially in the context of the December/January CPU
overload attacks, where some guards were overwhelmed by circuit creation
requests, and if you had a happy guard, lucky you, but if you had a sad
guard, you might as well delete your Tor Browser and try again.

Now, in an ideal world we should come up with fixes for all of those other
issues, for example by taking the Guard flag away from relays that can't
be great guards. But in the world we live in right now, we can relieve
some of that pressure-to-be-perfect by giving people two guards.

But if we're only going on a performance vs safety basis, I don't see a
huge rush to trade off safety until we have a better handle on what sort
of performance benefits we'd actually get, and until we've compared to
other low-hanging performance fruit.

In summary:

(1) I think we should fix the bug from #14917 where the attacker can
push us off our guard just by naming our guard as the HSDir/IP/RP,
and I think we should fix it by being willing to reuse our guard when
it can't be avoided. That step will resolve some, but not all, of the
pressure about moving to two guards. Then

(2) Hopefully the above discussion has helped us move forward on the
remaining reasons for switching to two guards. To me the two biggest
questions left to resolve are (a) how best to protect the vulnerable
transition to a new guard, and if two guards is the best idea we've got
for that, and (b) how big an issue is it really that having only one
guard can sometimes give you a low-performance guard, and if two guards
is the best idea we've got for that one too.

--Roger