"A. Johnson" aaron.m.johnson@nrl.navy.mil writes:
Hi George,
Thanks for the really thoughtful comments.
Two HS statistics that we (i.e. people working on Sponsor R) are interested in collecting are:
- The number of descriptor fetches received by a hidden-service directory (HSDir)
- The number of client introduction requests at an introduction points (IPs)
The privacy issue with #1 is that the set of HSDirs is (likely) unique to an HS, and so the number of descriptor fetches at its HSDirs could reveal the number of clients it had during a measurement period. Similarly, the privacy issue with #2 is that the set of IPs are (likely) unique to an HS, and so the number of client introductions at its IPs could reveal the number of client connections it received.
I was wondering, why do we care so much about these two statistics? From what I see in this post, you also just care about their total numbers (without connecting them to specific HSDirs or IPs). Is it because you are curious about the total number of HS users?
If that's the case, why not focus on "Number of rendezvous requests" which is not tied to specific hidden services or clients. It seems to me like an easier target (that would still need to be *thoroughly analyzed* of course).
Yes, there are other ways to estimate the number of clients and/or connections than descriptor fetches or INTRODUCE1 cells. However, I don’t want to give up on these statistics for a couple of reasons. First, none of these alternatives is exactly the same, and it might miss certain events that we want to know about. For example, I believe that there are a ton of “zombie” fetches from bonnet clients whose HS has died. We would never see this any other way. Or we might miss DoS attacks the work by flooding IPs with INTRODUCE1. Second, I see this as a first step to improving the privacy/accuracy tradeoff of statistics collection in general. For example, it would be great to be able to add noise just once, to a final network-wide output, rather than once for each relay.
I'm going to mainly focus on Anonstats2, since IIUC Anonstats1 leaks the probability of selection as an IP of that relay which leaks the identity of the relay:
AnonStats1 doesn’t leak the relay identity. The relay probability is sent over a separate circuit (at a random time). I intentionally did that just to avoid the problem you describe.
Ah, I see, that makes sense.
Some more notes from reading AnonStats1 then:
a) How do relays get more tokens when they deplete the initial 2k tokens? Is it easy for the StatAuth to generate 2k such tokens, or can relays DoS them by asking for tokens repeatedly?
b) It seems a bit weird to assume that all relay operators are good citizens, but still not trust the rest of the Internet at all (that's why we are doing the blind signature scheme, right?).
If an outside attacker wanted to influence the results, he could still sign up 10 relays on the network, get the blind signature tokens, and have them publish anonymized bad statistics, right?
In the median-based approach of AnonStats2 blind signatures make more sense, since they ensure that the adversary won't insert 1000 fake statistics to the network so that they influence the median.
Let's consider the descriptor fetches statistic (#1) and assume that you are a StatAuth. Also let's say that there are 5 hidden services that generate most of the descriptor fetches of the network (e.g. botnets). It should be easy for you to find the responsible HSDirs of those hidden services, and then it might be possible to match them up with the blinded relays that report the highest counts of descriptor fetches. Is this wanted?
You’re right about this issue. If you know in advance which HS is likely to have statistics in a certain position among all HSes (e.g. highest, lowest, middle, etc.), then you may be able to pick out those anonymized estimates that belong to that HS. Then to get the count you would have to guess the relay identity and divide by its statistical weight. A possible solution to this would be to add a bunch of dummy statistics that are distributed randomly but such that they don’t alter the median much. This would be done by the relays themselves.
Also, since the 'number of descriptor fetches' and 'number of IP requests' are related to each other, it might be possible to correlate these two statistics over different reporting relays. So for example, if you know that the network has a *very* popular HS, you can match its HSDir measurements with its IP measurements.
That's because the highest counts of both statistics will likely correspond to the HSDirs and IPs of the most popular hidden service of the network, if the most popular HS has a large user count difference from the least popular ones.
I can't think of a concrete attack behind this behavior, but it's something that blind signatures can't really protect us against, and we should think more about it and it's consequences.