Re: [tor-dev] Two protocols to measure relay-sensitive hidden-service statistics

10 Mar 2015

      Well, friends, here’s an attack on the whole anonymization approach to reporting HSDir (or other) statistics:
  1. The adversary creates a large number of onion services that almost always cover the entire set of HSDirs.
  2. The adversary performs actions with his onion services that add to the statistics (e.g. by doing directory fetches for his onion service “covering” a certain HSDir) such that each HSDir has its statistics value “set” to roughly a certain magnitude.
  3. The adversary looks at the anonymized statistics and deanonymizes the HSDirs by locating the statistic with the magnitude he set for that HSDir.

Does this just kill this whole idea? Maybe true aggregation is the only thing that can work.

Sorry!
Aaron
...
On Feb 18, 2015, at 10:42 AM, George Kadianakis <desnacked@riseup.net> wrote:
"A. Johnson" <aaron.m.johnson@nrl.navy.mil> writes:
...
Hello tor-dev,
<snip>
Two HS statistics that we (i.e. people working on Sponsor R) are interested in collecting are:
 1. The number of descriptor fetches received by a hidden-service directory (HSDir)
 2. The number of client introduction requests at an introduction points (IPs)
The privacy issue with #1 is that the set of HSDirs is (likely) unique to an HS, and so
the number of descriptor fetches at its HSDirs could reveal the number of clients it had during a
measurement period. Similarly, the privacy issue with #2 is that the set of IPs are (likely)
unique to an HS, and so the number of client introductions at its IPs could reveal the number of
client connections it received.
<snip>
The AnonStats1 protocol to privately publish both statistics if we trust relays not to pollute the
statistics (i.e. #2 is not a problem) is as follows:
 1. Each StatAuth provides 2k partially-blind signatures on authentication tokens to each relay in
 a consensus during the measurement period. The blind part of a signed token is simply a random
 number chosen by the relay. The non-blind part of a token consists of the start time of the
 current measurement period. The 2k tokens will allow the relay to submit k values to the
 StatAuths. Note that we could avoid using partially-blind signatures by changing keys at the
 StatAuths every measurement period and then simply providing blind signatures on random numbers.
 2. At the end of the measurement period, for each statistic, each relay uploads the following
 each on its own Tor circuit and accompanied by a unique token from each StatAuth:
   1. The count
   2. The ``statistical weight'' of the relay (1/(# HSDirs) for statistic #1 and the probability of
   selection as an IP for statistic #2)
 3. The StatAuths verify that each uploaded value is accompanied by a unique token from each
 StatAuth that is valid for the current measurement period. To infer the global statistic from
 the anonymous per-relay statistic, the StatAuths add the counts, add the weights, and divide
 the former by the latter.
Some more thoughts on AnonStats1:
- The two statistics proposed here are not independent. I suspect that
 the two numbers will actually be quite close to each other, since to
 do an intro request you need to first fetch a descriptor.
(In practice, the numbers *will be* different because a user might do
 multiple intro requests without fetching the descriptor multiple
 times. Or maybe a descriptor fetch failed so the client could not
 follow with an introduction request.)
My worry is that the numbers might be quite close most of the
 time. This means that about 9 relays (6 HSDirs + 3 IPs) will include
 that number -- the popularity of the HS -- in their result in the
 end. Of course, that number will get smudged along with all the
 other measurements that the reporting relay sees, but if the number
 is big enough then it will dominate the other measurements and the
 actual value might be visible in the results.
The above might sound stupid. Here are some brief calculations:
There are 30000 hidden services and 3000 HSDirs. The recent tech
 report shows that each HSDir is responsible for about 150 hidden
 services. This means that there are about 150 numbers that get
 smudged together every time. If *most* of those 30k hidden services
 are tiny non-popular ones, there is a non-negligible chance that
 most of those 150 numbers are also going to be tiny, which means
 that any moderately big number will stand out. And for every
 measurement period, there are 9 relays that have a chance of making
 this number stand out.
Another issue here is that if you assume that the popularity of
 hidden services doesn't change drastically overnight, and you
 believe in the above paragraph, it's even possible to track the
 popularity of hidden services even if you don't know their actual
 popularity value. To do that, everyday you check the reported
 measurements to check if there are any numbers close to yesterday's
 numbers. If this consistently happens over a few days, you can be
 pretty confident that you have found the popularity of a hidden
 service.
To take this stupidity one step further, you can model this whole
 thing as a system of 3000 equations with 150 unknown variables
 each. Each day you get a new system of equations. It wouldn't
 surprise me if the value of most variables is negligible (tiny
 hidden services) which can be ignored.  Everytime you find the
 popularity of a hidden service, you learn the value of another
 variable. If you assume that only 300 hidden services generate a
 substantial amount of HSDir requests, how many days do you need to
 find the value of those 300 variables?
Unfortunately, this is one of the worries that is hard to address
 without first making the whole thing and seeing how the actual
 numbers look like...
- And even if all the above is garbage, I'm still a bit concerned
 about the fact that the popularity of the *most popular* hidden
 service will be trackable using the above scheme. That's because the
 most popular hidden service will almost always dominate the other
 measurements.
- Also, the measurement period will have to change. Currently, each
 relay sends its extrainfo descriptor every 24 hours. For the
 AnonStats1 scheme to work, the measurement period needs to be
 non-deterministic, otherwise the StatsAuth can link relay
 measurements over different days based on when they reported stats.