Well, friends, here’s an attack on the whole anonymization approach to reporting HSDir (or other) statistics: 1. The adversary creates a large number of onion services that almost always cover the entire set of HSDirs. 2. The adversary performs actions with his onion services that add to the statistics (e.g. by doing directory fetches for his onion service “covering” a certain HSDir) such that each HSDir has its statistics value “set” to roughly a certain magnitude. 3. The adversary looks at the anonymized statistics and deanonymizes the HSDirs by locating the statistic with the magnitude he set for that HSDir.
Does this just kill this whole idea? Maybe true aggregation is the only thing that can work.
Sorry! Aaron
On Feb 18, 2015, at 10:42 AM, George Kadianakis desnacked@riseup.net wrote:
"A. Johnson" aaron.m.johnson@nrl.navy.mil writes:
Hello tor-dev,
<snip>
Two HS statistics that we (i.e. people working on Sponsor R) are interested in collecting are:
- The number of descriptor fetches received by a hidden-service directory (HSDir)
- The number of client introduction requests at an introduction points (IPs)
The privacy issue with #1 is that the set of HSDirs is (likely) unique to an HS, and so the number of descriptor fetches at its HSDirs could reveal the number of clients it had during a measurement period. Similarly, the privacy issue with #2 is that the set of IPs are (likely) unique to an HS, and so the number of client introductions at its IPs could reveal the number of client connections it received.
<snip>
The AnonStats1 protocol to privately publish both statistics if we trust relays not to pollute the statistics (i.e. #2 is not a problem) is as follows:
- Each StatAuth provides 2k partially-blind signatures on authentication tokens to each relay in
a consensus during the measurement period. The blind part of a signed token is simply a random number chosen by the relay. The non-blind part of a token consists of the start time of the current measurement period. The 2k tokens will allow the relay to submit k values to the StatAuths. Note that we could avoid using partially-blind signatures by changing keys at the StatAuths every measurement period and then simply providing blind signatures on random numbers. 2. At the end of the measurement period, for each statistic, each relay uploads the following each on its own Tor circuit and accompanied by a unique token from each StatAuth:
- The count
- The ``statistical weight'' of the relay (1/(# HSDirs) for statistic #1 and the probability of
selection as an IP for statistic #2) 3. The StatAuths verify that each uploaded value is accompanied by a unique token from each StatAuth that is valid for the current measurement period. To infer the global statistic from the anonymous per-relay statistic, the StatAuths add the counts, add the weights, and divide the former by the latter.
Some more thoughts on AnonStats1:
- The two statistics proposed here are not independent. I suspect that
the two numbers will actually be quite close to each other, since to do an intro request you need to first fetch a descriptor.
(In practice, the numbers *will be* different because a user might do multiple intro requests without fetching the descriptor multiple times. Or maybe a descriptor fetch failed so the client could not follow with an introduction request.)
My worry is that the numbers might be quite close most of the time. This means that about 9 relays (6 HSDirs + 3 IPs) will include that number -- the popularity of the HS -- in their result in the end. Of course, that number will get smudged along with all the other measurements that the reporting relay sees, but if the number is big enough then it will dominate the other measurements and the actual value might be visible in the results.
The above might sound stupid. Here are some brief calculations:
There are 30000 hidden services and 3000 HSDirs. The recent tech report shows that each HSDir is responsible for about 150 hidden services. This means that there are about 150 numbers that get smudged together every time. If *most* of those 30k hidden services are tiny non-popular ones, there is a non-negligible chance that most of those 150 numbers are also going to be tiny, which means that any moderately big number will stand out. And for every measurement period, there are 9 relays that have a chance of making this number stand out.
Another issue here is that if you assume that the popularity of hidden services doesn't change drastically overnight, and you believe in the above paragraph, it's even possible to track the popularity of hidden services even if you don't know their actual popularity value. To do that, everyday you check the reported measurements to check if there are any numbers close to yesterday's numbers. If this consistently happens over a few days, you can be pretty confident that you have found the popularity of a hidden service.
To take this stupidity one step further, you can model this whole thing as a system of 3000 equations with 150 unknown variables each. Each day you get a new system of equations. It wouldn't surprise me if the value of most variables is negligible (tiny hidden services) which can be ignored. Everytime you find the popularity of a hidden service, you learn the value of another variable. If you assume that only 300 hidden services generate a substantial amount of HSDir requests, how many days do you need to find the value of those 300 variables?
Unfortunately, this is one of the worries that is hard to address without first making the whole thing and seeing how the actual numbers look like...
- And even if all the above is garbage, I'm still a bit concerned
about the fact that the popularity of the *most popular* hidden service will be trackable using the above scheme. That's because the most popular hidden service will almost always dominate the other measurements.
- Also, the measurement period will have to change. Currently, each
relay sends its extrainfo descriptor every 24 hours. For the AnonStats1 scheme to work, the measurement period needs to be non-deterministic, otherwise the StatsAuth can link relay measurements over different days based on when they reported stats.