[tor-dev] Proposal: The move to two guard nodes

Mike Perry mikeperry at torproject.org
Wed Apr 18 23:31:26 UTC 2018

Roger Dingledine:
> On Wed, Apr 11, 2018 at 11:15:44AM +0000, Mike Perry wrote:
> > > To be clear, the design I've been considering here is simply allowing
> > > reuse between the guard hop and the final hop, when it can't be avoided. I
> > > don't mean to allow the guard (or its family) to show up as all four
> > > hops in the path. Is that the same as what you meant, or did you mean
> > > something more thorough?
> > 
> > By all path restrictions I mean for the last hop of the circuit and the
> > first (though vanguards would be simpler if we got rid of them for other
> > hops, too).
> Can you lay out for us the things to think about in the Vanguard design?
> Last I checked there were quite a few Vanguard design variants, ranging
> from "two vanguards per guard, tree style" to some sort of mesh.
> In particular, it would be convenient if there is a frontrunner design
> that really would benefit from relaxing many path restrictions, and a
> frontrunner design that is not so tied together to the path restriction
> question.

There are two frontrunner forms. One has no path restrictions, the other
would try to perform restriction checks on each layer to ensure that it
is valid and doesn't leak info about other layers or prevent circuit

They are otherwise the same. Both are mesh; both are tunable in the
number of guards and rotation times in each layer.

I am leaning towards "no restrictions" for vanguards for 0.3.4 because
it is simpler, and it did not strike me that the arguments in their
favor justified trying to implement them quickly in a way that might
cause reachability or path influence risks.
> > But I do mean all restrictions, not just guard node choice.
> > The adversary also gets to force you to use a second network path
> > whenever they want via the /16 and node family restrictions.
> Can you give us a specific example here, for this phrase "network
> path"? When you say "second network path" are you thinking in the
> Vanguard world?

Second path to entry into the Tor network (and a second guard),
regardless of vanguards.
> > > I'd like to hear more about the "cleverly crafted exit policy" attack
> > 
> >  another way to do this type of exit rotation attack is to cause
> > a client to look up a DNS name where you control the resolver, and keep
> > timing out on the DNS response. The client will then retry the stream
> > request with a new exit. The same thing can also be done by timing out
> > the TCP handshake to a server you control. Both of these attacks can be
> > done with only the ability to inject an img tag into a page.
> > 
> > You repeat this until an exit is chosen that is in the same /16 or
> > family as the guard, and then the client uses a second network path for
> > an unmultiplexed request at a time you control.
> The three fixes that come to mind are
> (A) "Have two guards": so you can pick any exit you like, and then just
> use the guard that doesn't conflict with the exit you picked.
> (B) "Add a bonus hop when needed": First relax the /16 and family
> restrictions, so the remaining issue is reuse of your guard. Then if
> you find that you just chose your guard as your exit, insert an extra
> hop in the middle of that circuit.
> (C) "Exits can't be Guards": First relax the /16 and family restrictions,
> so the remaining issue is reuse of your guard. Then notice that due
> to exit scarcity, guards aren't actually used in the exit position
> anyway. Then enforce that rule (so they can't be in the future either).
> All three of these choices have downsides. But all three of them look
> like improvements over the current situation -- because of how crappy
> the current situation is.
> (Rejected option (D): "Just start allowing it": Relax the /16 and
> family restrictions, and also relax the rule where relays refuse a
> circuit that goes right back where it came from. Giving the middle node
> that much information about the circuit just wigs me out.)
> Also, notice that I think Mike's proposed design will turn out to be some
> combination of "A" and also something like "B" or "C", because even if
> you start with two guards, if you don't add a new guard right when your
> first guard goes down, you might find yourself in the situation where
> you have one working guard, and you pick it as your exit, and now you
> need to do *something*.

The one-guard-down case does impact things. But even when this does
happen (which should be rare), it should only be true for a small window
of time before the consensus updates.

The "down" guard should either be temporarily overloaded, or fully down
and kicked off the consensus. I think we should only add a new guard
when one falls out of the consensus, or both are unreachable/unusable.

This is why I think it is OK to take an incremental approach and
start with A, and roll out things like B and C and other restriction

During these edge cases, the most important property that we should
strive to preserve is overall reachability. I don't like situations
where the adversary gains information by certain nodes being overloaded
or down. In my view, trying to make smart decisions to minimize exposure
to more nodes is secondary to overall reachability. (Overall
reachability allows a *non-network* adversary to gain information about
how clients are using our network. That strikes me as a lower resource,
more dangerous attack than the unknown risk of possible partial network
observers. In other words, I believe we made the right short-term call
in #14917 in terms of preserving reachability.)

> > Our path restrictions also cause normal exiting clients to use a second
> > guard for unmultiplexed activity, at adversary controlled times, or just
> > at periodically at random.
> Just to make sure I understand: at least on the current network,
> that's because of the /16 rule and the family rule, and not because of
> the "if the exit you picked turns out to be your guard too, move to a
> different guard" rule, because exits aren't normally used for guards on
> our current network?
> On more examination though, that's not something to rely on with our
> current design, since I bet there are weird edge cases like a relay
> loses its Guard flag, but it's still your Guard so you keep using it
> (depending on the advice from #17773), but now the weightings
> let you pick it for your Exit, and oops.
> Another problematic example would be a relay that you picked as your
> Guard, and later it opened up its exit policy and became an Exit.

I am in favor of preventing guards from being exits. Intuitively, it
means less "one stop shop" surveillance points to see both entry and
exit traffic. It also makes flag-based load balancing equations much
simpler, and makes it easier to account for padding overhead.
> So if I wanted to try to flesh out my "Then enforce that rule" approach
> above, we would need to (1) Have dir auths take away the Guard flag from
> relays that can be used as Exits, and (2) Make sure that clients know
> that if their guards lose the Guard flag, they should treat them as being
> no longer guardworthy. I think we're doing that second one right now,
> based on my latest reading of #17773, so this would actually be a pretty
> easy change. But still, it's not exactly elegant.

In the world where we keep path restrictions, these would be my rules:
1. Two equal guards, chosen from not the same /16 or family
2. Choose each vanguard layer members such that each layer has at least
   one node from a unique /16 and family.
3. Build paths in a strict order, from last hop towards guard. If you
   can't build a path with this ordering, start over with a sampled guard.
   (With rule #1 and #2, this should be very rare and should mean that
   a guard is marked down locally but still marked up in the consensus.)
4. No guards as exits (Not needed but do it anyway for other reasons).

Then under these rules, you decide to use a new primary guard, if:
0. When a guard leaves the consensus, replace it with a new primary
1. Temporarily pick a new guard when your two primaries are locally down
   or unusable (ie step #3 above fails).

> > > >   However, while removing path restrictions will solve the immediate
> > > >   problem, it will not address other instances where Tor temporarily opts
> > > >   use a second guard due to congestion, OOM, or failure of its primary
> > > >   guard, and we're still running into bugs where this can be adversarially
> > > >   controlled or just happen randomly[5].
> > > 
> > > I continue to think we need to fix these. I'm glad to see that George
> > > has been putting some energy into looking more at them. The bugs that
> > > we don't understand are especially worrying, since it's hard to know
> > > how bad they are. Moving to two guards might put a bit of a bandaid on
> > > the issues, but it can't be our long-term plan for fixing them.
> > 
> > We're choosing fixes for these bugs that enable an adversary to deny
> > service to clients at a particular guard, *without* letting those
> > clients move to a second guard. This enables confirmation attacks, and
> > these confirmation attacks can be extended to guard discovery attacks by
> > DoSing guards one at a time until an onion service fails.
> I would find non-onion-service examples more compelling here, since I
> want to avoid falling back into the "well, onion services need special
> treatment to be safe, so we have to choose between hurting normal clients
> and hurting onion services" trap.
> How is this for an alternative scenario to be considering: the attacking
> website gives the Tor Browser user some page content that causes the
> browser to initiate periodic events. Then it starts congesting guards
> one at a time until the events stop arriving.
> Are those two scenarios basically equivalent in terms of the confirmation
> attacks you are worrying about? I hope yes, and now I can stop getting
> distracted by wondering if going to this effort is worth it only to
> protect onion services? :)

> > > But I wonder if we're looking at this backwards, and the primary
> > > question we should be asking is "How can we protect the transition between
> > > guards?" Then one of the potential answers to consider is "Maybe we should
> > > start out with two guards rather than just one." Framing it that way,
> > > are there more options that we should consider too? For example, removing
> > > the ability of the non-local attacker to trigger a transition? Then
> > > there would still be visibility of a transition, but the (non-local)
> > > attacker can't impact the timing of the transition. How much does that
> > > solve? Need to think more.
> > 
> > One guard is inherently more fragile than two, and no matter what we do,
> > it means that there will be a risk of attacks that can confirm guard
> > choice, because the downtime during this transition can never be hidden
> > without at least some redundancy.
> How's this for another option: clients have two guards, but they have
> a first guard and a backup guard. They do the traffic padding to both
> of them, to ensure continuous netflow sessions in their local ISP's
> logs. But they try to send most of their traffic over the first guard,
> thus avoiding most of the "increased surface area" concerns about using
> two guards at once. And we try to reduce the frequency of situations where
> they can't use their first guard. But in the "transition" situations
> that we decide we need to keep, they use their backup guard, and it's
> already available and ready and that netflow session is already active
> in the eyes of their ISP.
> This approach isn't conflux (yet), but it's not incompatible with later
> changing things so we do conflux.
> It also doesn't get us the lower variance of performance that having
> two equally used guards would get us. But I am ok with that for now,
> at least until somebody has done some performance analysis to show that
> we're really suffering now and we would stop suffering then.

FYI, we actually do have one form of this info in figure 10 of

We get the largest performance gains from going from one guard to two,
in terms of reducing the variance (flatness) of that CDF.

Qualitatively, this means way fewer users who try Tor and experience a
very slow Tor, telling their friends that it is too slow and should not
be used. This is a real thing. Web UX folks have found that it happens
with perf variances in the sub-second range with websites.
> It adds load onto the relays, by almost doubling the number of sockets
> used by guards for clients, and also by adding more bandwidth load from
> the padding cells to/from the backup guard. (How much bandwidth load is
> this, per client?)
> And it doesn't actually provide as much "real" cover traffic onto the
> backup guard in most situations, so somebody who can look more thoroughly
> at the traffic flows will still be able to distinguish a transition
> event from the first to the backup. Maybe that's a problem? Or maybe
> the netflow level adversary that we declared in the threat model can't
> do that, and a real attacker would be able to see the traffic details
> anyway, so we're fine^W^Wno worse off than before?

There are a couple things here that make me think we may still be worse

1. The netflow padding is not designed to simulate client traffic. It is
designed to aggregate client traffic together over time in the
adversary's logs. Instead of seeing a discrete "520KB xfer in this 15
second period, 80KB in that one, and 2300KB in that one, and then
silence for 25 minutes", the adversary records "2900KB traffic total in
this half hour". For this aggregation to help, there really needs to be
other traffic during that half hour. This is why I keep saying that more
concurrent activity is better than only using the second guard
sometimes. (WTF-PAD could do things like you describe above, but we need
to program histograms+state machines for that).

2. Detection of when to switch to this second guard seems complicated
and error prone, and if it results in unavailability, it is strictly
worse. If it switches to the second guard at the first sign of
RESOURCELIMIT and path selection issues, well, then you're adding a lot
of complexity for how much benefit (and also complexity that could be
manipulated by the adversary).

> Whereas that first risk does seem plausible to me -- worth trying to
> reduce. I think we should start by enumerating as many scary scenarios
> as we can (where scary means "currently we would shift away from our
> first guard"), and then fix as many of them as we can. Then we should
> look at the remaining scenarios where we would switch over to using our
> backup guard (like, when our first guard isn't able to build new circuits
> for us), and decide if the cost of the additional load on the network is
> worth hiding that transition timing from a netflow-level client-side-ISP
> adversary. I can see the answer being "yes, it's worth it", but I think it
> will be useful to have a good handle on which transition scenarios remain.

Well, "fixing" the largest, most frequent, and adversary controlled
classes of these requires:

1. Removing path restrictions.
2. Recognizing DoS attacks and differentiating them from bad network

#2 is what worries me. Any solution to #2 that is agile enough to avoid
downtime strikes me as no better than "switch to guard #2 with
probability 1/2 after a RESOURCELIMIT or any other circuit failure"
(which is what the code would do today with two equal guards), and a
hell of a lot more complex (with risk of a downtime signal or adversary
path influence if we get it wrong).

Mike Perry
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 801 bytes
Desc: Digital signature
URL: <http://lists.torproject.org/pipermail/tor-dev/attachments/20180418/704e1910/attachment.sig>

More information about the tor-dev mailing list