Thus spake Nick Mathewson (nickm@alum.mit.edu):
On Fri, Oct 12, 2012 at 3:17 PM, Mike Perry mikeperry@torproject.org wrote:
Thus spake Nick Mathewson (nickm@torproject.org):
Discussion:
The rule that the set of guards and the set of directory guards need to be disjoint, and the rule that multiple directory guards need to be providing descriptors, are both attempts to make it harder for a single node to capture a route.
Can you explain the route capture opportunities available to directory guards? Is it #5343/#5956?
Like that general class, yes. It worries me to have too few sources of directory info; with bridges we have no choice, but with directory guards, we can make sure that we have multiple sources.
In particular, it's a little obnoxious for the same party to be both the first hop of your circuit, *and* to know exactly what you know about possible candidates for hop 2 and hop 3.
Ok, so it sounds like this is more the second rule than the first rule?
And how does the attack work? Can directory mirrors simply say "Sorry man, that descriptor doesn't exist", even though the client sees it listed in the consensus?
No, but they can say "Sorry, I don't have that descriptor." (Same thing actually, but not totally suspicious. But maybe let's analyze it and figure out how much it really happens in practice for an honest guard.)
Shouldn't clients just try another directory source in this case?
Maaybe. If all their directory guards but *one* are down, my claim is that they should not rely on just that guard. There are alternative designs where you don't add directory guards unless all your guards are down, and I don't think those are right.
Ok, this makes sense. Also second rule?
The reason I'm asking is because if we use the same Guard nodes for both directory and normal traffic, this adds additional traffic patterns to the set of things that Website Traffic Fingerprinting attacks must classify, which further reduces the accuracy of that attack.
Hm. An interesting thought.
My first inclination here is to ask, "Can we analyze this to figure out the benefit/risk of each approach and somehow make a mathy/quantitative argument about which is better?" I don't know that we'll come up with a final answer, but I think we could do well to try to figure out how large/small benefits are likely to be.
My favorite work in the Tor Website Traffic Fingerprinting space[1] actually measures this effect quite well. Have a look at Figure 4 in section 5.2.2 in the "Open World" dataset (page 8). As we add more background noise to the "Open World" of things that are fetched through Tor Guard nodes, the true positive accuracy of the attack drops off.
In general, with more objects to classify and few features to extract, either true positive accuracy goes down, or false positive rate goes up. Especially when the objects are relatively low-resolution in terms of additional reliable features to extract.
Further, because of the base rate fallacy[2], the adversary needs to make heavy, heavy tradeoffs to ensure their false positive rate stays way, way down. This means any objects we add to the "world" of Tor Guard traffic pretty much are guaranteed to decrease true positive accuracy of the attack in terms of webpages they can reliably recognize.
(Incidentally, I believe the authors of [1] understood the danger of false positives, and that's why their graphs look the way they do. It's not clear other traffic fingerprinting authors understand this concept. In fact, for many of them, it's quite clear they do not.)
So, any games we can play to make directory activity look like client web activity (especially different types and sizes of web activity) are bonus win against the attack that cost us no traffic overhead.
[1]. http://lorre.uni.lu/~andriy/papers/acmccs-wpes11-fingerprinting.pdf [2]. http://ksubrick.ist.psu.edu/viewdoc/summary?doi=10.1.1.1.8982