[tor-dev] Proposal: The move to two guard nodes

Roger Dingledine arma at mit.edu
Wed Apr 18 08:27:51 UTC 2018

On Wed, Apr 11, 2018 at 11:15:44AM +0000, Mike Perry wrote:
> > To be clear, the design I've been considering here is simply allowing
> > reuse between the guard hop and the final hop, when it can't be avoided. I
> > don't mean to allow the guard (or its family) to show up as all four
> > hops in the path. Is that the same as what you meant, or did you mean
> > something more thorough?
> By all path restrictions I mean for the last hop of the circuit and the
> first (though vanguards would be simpler if we got rid of them for other
> hops, too).

Can you lay out for us the things to think about in the Vanguard design?
Last I checked there were quite a few Vanguard design variants, ranging
from "two vanguards per guard, tree style" to some sort of mesh.

In particular, it would be convenient if there is a frontrunner design
that really would benefit from relaxing many path restrictions, and a
frontrunner design that is not so tied together to the path restriction

> But I do mean all restrictions, not just guard node choice.
> The adversary also gets to force you to use a second network path
> whenever they want via the /16 and node family restrictions.

Can you give us a specific example here, for this phrase "network
path"? When you say "second network path" are you thinking in the
Vanguard world?

> We're not using one guard in the current Tor. We're using two, and the
> second one is only used for unmultiplexed activity. That is one property
> I don't like about our "let's pretend to use one guard" status quo.

Right, I agree.

> > I'd like to hear more about the "cleverly crafted exit policy" attack
>  another way to do this type of exit rotation attack is to cause
> a client to look up a DNS name where you control the resolver, and keep
> timing out on the DNS response. The client will then retry the stream
> request with a new exit. The same thing can also be done by timing out
> the TCP handshake to a server you control. Both of these attacks can be
> done with only the ability to inject an img tag into a page.
> You repeat this until an exit is chosen that is in the same /16 or
> family as the guard, and then the client uses a second network path for
> an unmultiplexed request at a time you control.

Hm! Yes, this is a yucky one. (I don't think just an img tag would be
enough, because Tor will try a few circuits and then give up. You'd need
some sort of javascript or refresh chain or the like that generates new
addresses and tries them in succession. But that's totally feasible.)

This one is also yucky because we could also imagine a different way to
pick your path, where when you're selecting your exit, you avoid choosing
exits which would conflict with your guard, and thus you'll never be
pushed off of your guard. But then the destination website can do this
same attack over time and notice which exit you never try to use. So
this is a case where to blend in best, we *need* to be willing to use
all of the potential exits.

But since normal exit circuits are three hops, if we simply relax the
path restrictions, we could be making a circuit of the form "A - B - A",
which would not only stand out as weird to B, but actually right now a
relay in B's position will refuse such a circuit. Bad news all around.

The three fixes that come to mind are

(A) "Have two guards": so you can pick any exit you like, and then just
use the guard that doesn't conflict with the exit you picked.

(B) "Add a bonus hop when needed": First relax the /16 and family
restrictions, so the remaining issue is reuse of your guard. Then if
you find that you just chose your guard as your exit, insert an extra
hop in the middle of that circuit.

(C) "Exits can't be Guards": First relax the /16 and family restrictions,
so the remaining issue is reuse of your guard. Then notice that due
to exit scarcity, guards aren't actually used in the exit position
anyway. Then enforce that rule (so they can't be in the future either).

All three of these choices have downsides. But all three of them look
like improvements over the current situation -- because of how crappy
the current situation is.

(Rejected option (D): "Just start allowing it": Relax the /16 and
family restrictions, and also relax the rule where relays refuse a
circuit that goes right back where it came from. Giving the middle node
that much information about the circuit just wigs me out.)

Also, notice that I think Mike's proposed design will turn out to be some
combination of "A" and also something like "B" or "C", because even if
you start with two guards, if you don't add a new guard right when your
first guard goes down, you might find yourself in the situation where
you have one working guard, and you pick it as your exit, and now you
need to do *something*.

> Our path restrictions also cause normal exiting clients to use a second
> guard for unmultiplexed activity, at adversary controlled times, or just
> at periodically at random.

Just to make sure I understand: at least on the current network,
that's because of the /16 rule and the family rule, and not because of
the "if the exit you picked turns out to be your guard too, move to a
different guard" rule, because exits aren't normally used for guards on
our current network?

On more examination though, that's not something to rely on with our
current design, since I bet there are weird edge cases like a relay
loses its Guard flag, but it's still your Guard so you keep using it
(depending on the advice del año from #17773), but now the weightings
let you pick it for your Exit, and oops.

Another problematic example would be a relay that you picked as your
Guard, and later it opened up its exit policy and became an Exit.

So if I wanted to try to flesh out my "Then enforce that rule" approach
above, we would need to (1) Have dir auths take away the Guard flag from
relays that can be used as Exits, and (2) Make sure that clients know
that if their guards lose the Guard flag, they should treat them as being
no longer guardworthy. I think we're doing that second one right now,
based on my latest reading of #17773, so this would actually be a pretty
easy change. But still, it's not exactly elegant.

> > >   However, while removing path restrictions will solve the immediate
> > >   problem, it will not address other instances where Tor temporarily opts
> > >   use a second guard due to congestion, OOM, or failure of its primary
> > >   guard, and we're still running into bugs where this can be adversarially
> > >   controlled or just happen randomly[5].
> > 
> > I continue to think we need to fix these. I'm glad to see that George
> > has been putting some energy into looking more at them. The bugs that
> > we don't understand are especially worrying, since it's hard to know
> > how bad they are. Moving to two guards might put a bit of a bandaid on
> > the issues, but it can't be our long-term plan for fixing them.
> We're choosing fixes for these bugs that enable an adversary to deny
> service to clients at a particular guard, *without* letting those
> clients move to a second guard. This enables confirmation attacks, and
> these confirmation attacks can be extended to guard discovery attacks by
> DoSing guards one at a time until an onion service fails.

I would find non-onion-service examples more compelling here, since I
want to avoid falling back into the "well, onion services need special
treatment to be safe, so we have to choose between hurting normal clients
and hurting onion services" trap.

How is this for an alternative scenario to be considering: the attacking
website gives the Tor Browser user some page content that causes the
browser to initiate periodic events. Then it starts congesting guards
one at a time until the events stop arriving.

Are those two scenarios basically equivalent in terms of the confirmation
attacks you are worrying about? I hope yes, and now I can stop getting
distracted by wondering if going to this effort is worth it only to
protect onion services? :)

> You keep focusing on the performance aspects of conflux, but that is not
> the argument I am making. My arguments for conflux in Section 4 are
> about resilience to congestion, downtime, circuit killing, and DoS, as
> well as traffic analysis resistance. I see the performance benefits as
> secondary. 

I like conflux in theory, but somebody needs to do the other 90%
of the work to make it a concrete thing that we can consider.

I continue to think "Tor should switch to two guards, because one day
we should design and deploy conflux" is a terrible reason to switch to
two guards now.

So I didn't mean to mix the conflux discussion and the performance
discussion. I meant to mostly ignore the conflux discussion (because it
is a future proposal, not this one), while also making sure that we don't
forget the potential performance benefits of having two guards in general.

> > But I wonder if we're looking at this backwards, and the primary
> > question we should be asking is "How can we protect the transition between
> > guards?" Then one of the potential answers to consider is "Maybe we should
> > start out with two guards rather than just one." Framing it that way,
> > are there more options that we should consider too? For example, removing
> > the ability of the non-local attacker to trigger a transition? Then
> > there would still be visibility of a transition, but the (non-local)
> > attacker can't impact the timing of the transition. How much does that
> > solve? Need to think more.
> One guard is inherently more fragile than two, and no matter what we do,
> it means that there will be a risk of attacks that can confirm guard
> choice, because the downtime during this transition can never be hidden
> without at least some redundancy.

How's this for another option: clients have two guards, but they have
a first guard and a backup guard. They do the traffic padding to both
of them, to ensure continuous netflow sessions in their local ISP's
logs. But they try to send most of their traffic over the first guard,
thus avoiding most of the "increased surface area" concerns about using
two guards at once. And we try to reduce the frequency of situations where
they can't use their first guard. But in the "transition" situations
that we decide we need to keep, they use their backup guard, and it's
already available and ready and that netflow session is already active
in the eyes of their ISP.

This approach isn't conflux (yet), but it's not incompatible with later
changing things so we do conflux.

It also doesn't get us the lower variance of performance that having
two equally used guards would get us. But I am ok with that for now,
at least until somebody has done some performance analysis to show that
we're really suffering now and we would stop suffering then.

It adds load onto the relays, by almost doubling the number of sockets
used by guards for clients, and also by adding more bandwidth load from
the padding cells to/from the backup guard. (How much bandwidth load is
this, per client?)

And it doesn't actually provide as much "real" cover traffic onto the
backup guard in most situations, so somebody who can look more thoroughly
at the traffic flows will still be able to distinguish a transition
event from the first to the backup. Maybe that's a problem? Or maybe
the netflow level adversary that we declared in the threat model can't
do that, and a real attacker would be able to see the traffic details
anyway, so we're fine^W^Wno worse off than before?

Assuming this design meets all of our goals, let's examine two variants
of it to make sure we understand what we're actually trading off. In
particular, consider a design where we maintain (and pad) these two
connections, vs a design where we maintain a connection to our first
guard and then launch a connection to the backup guard on demand. The
downside of keeping the backup connection open is the extra network-wide
socket and bandwidth load on relays, while the downsides of launching
a connection on demand are the risk that a local netflow-level ISP can
see when we transition to using the backup guard, plus the risk that a
remote attacker who can cripple guards will be able to notice the delay
in the "launch on demand case" but could not distinguish the delay in
the "two connections" case.

That second risk doesn't seem so scary to me, since local handshakes
should be a small fraction of the overall time it takes to build and use
a new circuit. But above you say "the downtime during this transition can
never be hidden without at least some redundancy", so if you think this
risk is scary, I'd like to hear more details about why. (Maybe the design
you were concerned about was one where we just freeze in place and fail
when we don't want to use our first guard? I agree, that's a bad design,
and we can do better, for example by "be willing to use the second guard".)

Whereas that first risk does seem plausible to me -- worth trying to
reduce. I think we should start by enumerating as many scary scenarios
as we can (where scary means "currently we would shift away from our
first guard"), and then fix as many of them as we can. Then we should
look at the remaining scenarios where we would switch over to using our
backup guard (like, when our first guard isn't able to build new circuits
for us), and decide if the cost of the additional load on the network is
worth hiding that transition timing from a netflow-level client-side-ISP
adversary. I can see the answer being "yes, it's worth it", but I think it
will be useful to have a good handle on which transition scenarios remain.


More information about the tor-dev mailing list