tl;dr: analysis seems to indicate that switching to one guard node might not be catastrophic to the performance of Tor. To improve performance some increased guard bandwidth thresholds are proposed that seem to help without completely destroying the anonymity of the network. Enjoy the therapeutic qualities of the graphs and please read the whole post.
We start this post by assuming that we _should_ switch to one guard for the security/anonymity arguments that were detailed in Tariq's paper and Roger's blog post.
=== Performance implications of switching to 1 guard ===
The question now becomes, if we indeed switch to 1 guard, how does that influence the performance of the Tor network? To answer this question we look at the following graph which shows the expected bandwidth for a client circuit:
https://people.torproject.org/~asn/guards2/perf_cdf_guard_bw_desc.png (see green and orange lines)
(I calculate the bandwidth using the descriptor bandwidth values [0] and in the case of 3 guards we measure the expected bandwidth as the average of the bandwidths of the three guard. [1])
For example, looking at the graph, we see that when three guards are used, 1/5th of the clients will have performance below 5MB/s, whereas with one guard 1/5th of the clients will have performance below 3MB/s. Assuming that our assumptions are logical, this is almost half of the bandwidth for the unlucky 1/5th single guard clients that happened to pick a weak guard: not good.
At a later stage of our CDF, we see that in the three guards case, half of the clients will have performance below 8MB/s whereas in the one guard case they will have performance below 7MB/s. This is not terribly bad, and the reason for this is that powerful guards have more chance to be selected, so single-guard clients will tend to pick those.
Finally, a crossover happens for the lucky 2/5ths of the single guard clients, where they actually experience better performance than the three guards clients since they picked a powerful guard and they only use that. This is interesting but in real life the results might not be so peachy, because the powerful guards will get more overloaded.
=== Client performance implications of bumping up the guard bandwidth threshold ===
So, now that we analyzed the performance implications of using a single guard, let's see if we can improve the performance. One obvious way of doing so is by increasing the bandwidth threshold for the Guard flag. The threshold is currently at 250KB/s (according to dir-spec), but let's see what happens from a performance perspective if we bump it up to 2MB/s. Looking at the same graph as before, now pay attention to the blue line.
We can see that for the unlucky 1/5th of the single guard clients who had a bandwidth of 3MB/s, their bandwidth now becomes 4MB/s, which seems like a decent improvement. Furthermore, the crossover happens earlier now, which means that _supposedly_ half of the clients are going to have better performance (modulo guard overload) compared to the three guard case!
I also made graphs for a bandwidth threshold of 1MB/s (since 2MB/s sounded too crazy), you can find them here [2]: https://people.torproject.org/~asn/guards2/perf_cdf_guard_bw_desc_1000.png https://people.torproject.org/~asn/guards2/perf_cdf_guard_bw_consensus_1000....
=== Network performance implications of bumping up the guard bandwidth threshold ===
Now that we analyzed the performance difference for individual clients, let's see what will happen to the total bandwidth of the Tor network if we bump up the guard bandwidth threshold. This might help us understand how much we will overload the Tor network with this change.
Here is a graph that shows the fraction of the total guard bandwidth we discard when we impose various bandwidth thresholds [3]: https://people.torproject.org/~asn/guards2/perf_bw_fraction.png {1}
The graph above is not very meaningful on its own, but it combos well with the following metrics graph: https://metrics.torproject.org/network.html#bandwidth-flags {2} (see yellow and orange lines)
From {2}, we see that the Tor network has 6000MiB/s advertised guard
bandwidth (orange line), but supposedly is only using the 3500MiB/s (yellow line). This means, that supposedly we are only using 3/5ths of our guard capacity: we have 2500MiB/s spare.
Looking back at {1}, we see that if we increase the guard bandwidth threshold to 2MB/s we will discard 1/10th of our total guard bandwidth. This is not a terrible problem if we have 2/5ths of spare guard capacity...
.oO(this sounds too good to be true, doesn't it?)
=== Security implications of bumping up the guard bandwidth threshold ===
Unfortunately, we can't just simply go about and discard most of our guard nodes. Discarding nodes has definite implications to the anonymity of the Tor network. Let's try to understand them.
Here is a graph that shows the number of guard nodes and how that changes over different bandwidth thresholds: https://people.torproject.org/~asn/guards2/diversity_guards_n.png
For example, we see that increasing the bandwidth threshold to 2MB/s will cut our guard nodes to half: from 2000 to 1000. This is not really good. Even a smaller threshold of 1MB/s will cut them down to 1400 or so.
But before we pull a Filliol, let's try to understand how much discarding 1000 guard nodes influences the diversity of our guard selection. Here is a graph that shows what's the probability of picking any of the guard nodes we discarded for different bandwidth thresholds: https://people.torproject.org/~asn/guards2/diversity_discarded_prob.png
So for example, we see that those 1000 nodes that we discarded in the 2MB/s case, only had 0.07 probability of being selected. That's around 1/15 chance of picking one of those 1000 guard nodes, so even though there were many of them they were not providing much diversity to the guard selection process. Of course, there are many possible attacks and threat models involving guards, so this analysis might be valid to some and irrelevant to others.
The fact that those guards had only 1/15 chance of being selected also gives us hope that we will not overload the network by discarding them, since only a "small" portion of clients were choosing them anyway. These clients will now spread to the rest of the other 1000 nodes which are much better at handling them (famous last words).
=== Fingerprinting implications of switching to 1 guard ===
See https://bugs.torproject.org/10969 for the background of this.
Here is a graph with the expected number of clients for the biggest and smallest guard over different bandwidth thresholds: https://people.torproject.org/~asn/guards2/fingerprinting_expected_clients.p... The graph considers 500k clients choosing guards simultaneously.
Switching to 1 guard will make guard set fingerprinting harder if you are a lucky client that picked a popular guard, since now you are blending in with thousands of other clients who are using that guard.
If you were unlucky to chose a small guard, your anonymity set is still shit. For example, without considering bandwidth cutoffs, the smallest guard has an expected number of clients less than 1, which means that it will uniquely represent you. Even with a bandwidth cutoff of 2MB/s, the expected number of clients is 10 which is not much better. Heck, even with a cutoff of 9MB/s, there will only be 100 clients in average for the smallest guard; that's a pretty small number if we consider Tor clients all over the globe.
=== Conclusions ===
It seems that the performance implications of switching to 1 guard are not terrible. The performance of some clients will indeed get worse, but we might be able to help that by increasing the bandwidth threshold for being a Guard node.
A guard bandwidth threshold of 2MB/s (or 1MB/s if that sounds too crazy) seems like it would considerably improve client performance without screwing terribly with the security or the total performance of the network.
The fingerprinting problem will be improved in some cases, but still remains unsolved for many of the users (TODO: calculate the percentage). A proper solution might involve guard node buckets as explained in : https://trac.torproject.org/projects/tor/ticket/9273#comment:4
Also, through the analysis it seems that people who pick slow guards are unlucky (even though they will share those guards with less people). Should we do anything about people who are going to choose new guards till they hit the good ones? Or torrc lines on the Internet that statically pick the best guard nodes?
=== Closing notes and disclaimers ===
I would say that our analysis has shown that switching to one guard is probably viable but we should be aware of the drawbacks and be prepared for possible surprises.
Furthermore, I would like to disclose that one month ago I didn't even know how guard node selection happens and now I'm partly responsible for choosing whether we switch to one guard node or not. Also, even though this project is a serious research project, I felt that I had to rush it and do it in 3 weeks. This was not ideal, because I don't feel I understand all the variables in the equation. So please read the whole document and make sure that I have not fucked up majorly. I would like to avoid being the man who destroyed the Tor network ;)
Also, it's my first time producing graphs with Python, so I wouldn't be surprised if there are errors. Hopefully most of the graphs that I produced seem to agree with the graphs that Nick Hopper or Tariq have produced, which gives me some slight confidence.
The code I used can be found in https://gitorious.org/guards/guards [4] You can find all the graphs here: https://people.torproject.org/~asn/guards2/
Don't worry be happy.
[0]: Important note: even though I calculate the plotted bandwidth using descriptor bandwidth values, I still calculate the guard probabilities using the consensus bandwidth values. This seemed to me to be the correct way; if it's not I can easily change it.
Also see https://people.torproject.org/~asn/guards2/perf_cdf_guard_bw_consensus.png for the same graph but using the bandwidth values from the consensus (measured by the bandwidth authorities) everywhere.
[1]: This graph is taking the pretty bold assumption that "higher guard bandwidth' == "better client performance" which is probably not entirely true because of the bandwidth-based load balancing during path selection. However, we need an assumption to work with and this one might not be too bad.
It also takes the assumption, that the mean of the bandwidth of three guards represents the actual performance of a client, which is not entirely true. A correct solution in this case should take the circuit-build-times (CBT) logic of tor into account.
[2]: Because of technical difficulties I could not put everything in one graph! Graphs are hard!
[3]: Nick Hopper made a similar graph earlier in this thread: https://www-users.cs.umn.edu/~hopper/guards/guard_thresholds_bandwidth.png
[4]: It's rushed research quality code, which means that I'm probably the only person who can use it atm. If you feel experimental, you can try generating some graphs, for example: $ python guard_probs.py consensus descriptors
On Thu, Mar 13, 2014 at 10:21:38PM +0000, George Kadianakis wrote:
From {2}, we see that the Tor network has 6000MiB/s advertised guard bandwidth (orange line), but supposedly is only using the 3500MiB/s (yellow line). This means, that supposedly we are only using 3/5ths of our guard capacity: we have 2500MiB/s spare.
Looking back at {1}, we see that if we increase the guard bandwidth threshold to 2MB/s we will discard 1/10th of our total guard bandwidth. This is not a terrible problem if we have 2/5ths of spare guard capacity...
.oO(this sounds too good to be true, doesn't it?)
[snip]
So for example, we see that those 1000 nodes that we discarded in the 2MB/s case, only had 0.07 probability of being selected.
There's an interesting interaction here, where by being more selective about what counts as a guard, we push more relays into only being suitable for the middle hop of the circuit.
While we always talk about how the Tor network is a clique, in approximation it's really three layers:
{ fast non-exits } -------- { slow non-exits} -------- { exits}
And very broadly speaking, our proposal here pushes half of the relays from the first set into the second set.
I wonder what other effects this change has, e.g. on the expected number of file descriptors that relays of each category will use.
It would be interesting to learn, from your 6000MiB/s and 3500MiB/s numbers above, how much of that bandwidth was from what position in the circuit. For example, a pretty big fraction (by bandwidth) of the fast guards are also fast exits, so by making guard choice more selective, we're moving those relays *out* of other positions in the circuit, with implications that I don't fully understand. I don't think there's an easy way to learn this breakdown though.
Looking at it this way also makes me wonder about using Conflux to glue together two relays from the middle category, since the middle category is where the small relays go.
Or looking at it from the other direction, if we raise the threshold for being a guard to 2MB/s, and we get a bunch of volunteer non-exit relays on fast cablemodems (1MB/s), the only position we can use those smaller non-exit relays is in the middle hop. So we could imagine a world where we have a glut of extra capacity in the middle hop, since you can't exit from it and it's not concentrated enough to use any of the relays as guards.
Might we end up with oscillations, where the non-exit non-guards receive inflated consensus weights because they're underused, pushing them over the edge to get the Guard flag, under they accumulate some users and don't look so hot anymore, in which case they lose the Guard flag?
I don't think any of these issues is enough to slow us down on the general direction. But they illustrate that there's a lot more complexity underneath.
--Roger
Roger Dingledine arma@mit.edu writes:
On Thu, Mar 13, 2014 at 10:21:38PM +0000, George Kadianakis wrote:
From {2}, we see that the Tor network has 6000MiB/s advertised guard bandwidth (orange line), but supposedly is only using the 3500MiB/s (yellow line). This means, that supposedly we are only using 3/5ths of our guard capacity: we have 2500MiB/s spare.
Looking back at {1}, we see that if we increase the guard bandwidth threshold to 2MB/s we will discard 1/10th of our total guard bandwidth. This is not a terrible problem if we have 2/5ths of spare guard capacity...
.oO(this sounds too good to be true, doesn't it?)
[snip]
So for example, we see that those 1000 nodes that we discarded in the 2MB/s case, only had 0.07 probability of being selected.
There's an interesting interaction here, where by being more selective about what counts as a guard, we push more relays into only being suitable for the middle hop of the circuit.
While we always talk about how the Tor network is a clique, in approximation it's really three layers:
{ fast non-exits } -------- { slow non-exits} -------- { exits}
And very broadly speaking, our proposal here pushes half of the relays from the first set into the second set.
I wonder what other effects this change has, e.g. on the expected number of file descriptors that relays of each category will use.
It would be interesting to learn, from your 6000MiB/s and 3500MiB/s numbers above, how much of that bandwidth was from what position in the circuit. For example, a pretty big fraction (by bandwidth) of the fast guards are also fast exits, so by making guard choice more selective, we're moving those relays *out* of other positions in the circuit, with implications that I don't fully understand. I don't think there's an easy way to learn this breakdown though.
Hm, that would be helpful to have, yes.
Maybe we need to add 'guard-write-history', 'middle-write-history', 'exit-write-history' fields in extra-info descriptors, so that we can analyze the different types of traffic that each relay pushes.
Looking at it this way also makes me wonder about using Conflux to glue together two relays from the middle category, since the middle category is where the small relays go.
Or looking at it from the other direction, if we raise the threshold for being a guard to 2MB/s, and we get a bunch of volunteer non-exit relays on fast cablemodems (1MB/s), the only position we can use those smaller non-exit relays is in the middle hop. So we could imagine a world where we have a glut of extra capacity in the middle hop, since you can't exit from it and it's not concentrated enough to use any of the relays as guards.
If this happens maybe we could increase the weights of those underused middle nodes for other tasks, like being rendezvous points or IPs (risky anonymity implications here).
Or maybe instead of relying that much on absolute bandwidth thresholds, we should revise our relative bandwidth thresholds: so for example, guard nodes _need_ to be on the top 1/8th of the fastest relays.