Hi George,
Thanks for the really thoughtful comments.
Two HS statistics that we (i.e. people working on Sponsor R) are interested in collecting are:
- The number of descriptor fetches received by a hidden-service directory (HSDir)
- The number of client introduction requests at an introduction points (IPs)
The privacy issue with #1 is that the set of HSDirs is (likely) unique to an HS, and so the number of descriptor fetches at its HSDirs could reveal the number of clients it had during a measurement period. Similarly, the privacy issue with #2 is that the set of IPs are (likely) unique to an HS, and so the number of client introductions at its IPs could reveal the number of client connections it received.
I was wondering, why do we care so much about these two statistics? From what I see in this post, you also just care about their total numbers (without connecting them to specific HSDirs or IPs). Is it because you are curious about the total number of HS users?
If that's the case, why not focus on "Number of rendezvous requests" which is not tied to specific hidden services or clients. It seems to me like an easier target (that would still need to be *thoroughly analyzed* of course).
Yes, there are other ways to estimate the number of clients and/or connections than descriptor fetches or INTRODUCE1 cells. However, I don’t want to give up on these statistics for a couple of reasons. First, none of these alternatives is exactly the same, and it might miss certain events that we want to know about. For example, I believe that there are a ton of “zombie” fetches from bonnet clients whose HS has died. We would never see this any other way. Or we might miss DoS attacks the work by flooding IPs with INTRODUCE1. Second, I see this as a first step to improving the privacy/accuracy tradeoff of statistics collection in general. For example, it would be great to be able to add noise just once, to a final network-wide output, rather than once for each relay.
I'm going to mainly focus on Anonstats2, since IIUC Anonstats1 leaks the probability of selection as an IP of that relay which leaks the identity of the relay:
AnonStats1 doesn’t leak the relay identity. The relay probability is sent over a separate circuit (at a random time). I intentionally did that just to avoid the problem you describe.
The AnonStats1 protocol is vulnerable to a relay that publishes manipulated statistics (i.e. challenge #2). If this is a concern, the AnonStats2 protocol mitigates it by using a median statistic intead of a sum:
- Relays are divided into b bins by consensus weight. The bins have exponentially-increasing
length, and the rate of increase c is set such that the ith bin by increasing weights has at least r relays each of weight at least some minimum min_weight (say, r=500, min_weight=100). The first bin has weights in [0, min_weight), and the ith bin has weights in [min_weight*c^(i-2), min_weight*c^(i-1)).
I'm curious to see how this binning strategy would work with the current Tor network. The network and hence the anonymity set of its relays is not that big. Also, the *big* hidden services are not that secret lately.
Here’s a rough estimate. There were 5829 fast relays in the 11/30/2014 00:00 consensus. That consensus has a max consensus weight of 190000. If we set the min_weight=500 and c=2, then we get 9 bins with the following number of relays in each bin: [2927, 545, 660, 416, 463, 439, 199, 142, 37]. There were 3290 fast HSDirs in that consensus, with a max weight also of 190000. With the same bin parameters, we get 9 bins with the following numbers of HSDirs in each bin: [1375, 323, 396, 231, 315, 342, 154, 120, 34].
Let's consider the descriptor fetches statistic (#1) and assume that you are a StatAuth. Also let's say that there are 5 hidden services that generate most of the descriptor fetches of the network (e.g. botnets). It should be easy for you to find the responsible HSDirs of those hidden services, and then it might be possible to match them up with the blinded relays that report the highest counts of descriptor fetches. Is this wanted?
You’re right about this issue. If you know in advance which HS is likely to have statistics in a certain position among all HSes (e.g. highest, lowest, middle, etc.), then you may be able to pick out those anonymized estimates that belong to that HS. Then to get the count you would have to guess the relay identity and divide by its statistical weight. A possible solution to this would be to add a bunch of dummy statistics that are distributed randomly but such that they don’t alter the median much. This would be done by the relays themselves.
- Each StatAuth provides k partially-blind signatures on authentication tokens to each relay in
a consensus during the measurement period. The blind part of a signed token is simply a random number chosen by the relay. The non-blind part of a token consists of the start time of the current measurement period and the bin containing the relay's consensus weight. 3. At the end of the measurement period, for each statistic, each relay divides each statistic by the relay's ``statistical weight'' (1/(# HSDirs) for statistic #1 and the probability of selection as an IP for statistic #2). The result is the relay's estimate of the global value of the statistic. The relay then uploads this value via a new Tor circuit, accompanied by a unique token from each StatAuth.
It's worth mentioning that different relays use different consensuses and hence have different views of the network. So, relays will calculate their <probability of selection as an IP> using different/older consensuses than others.
I think that’s fine. This purpose of using the weighted median is to give those relays that see more traffic more of a say in the outcome. It doesn’t need to be exact.
- The StatAuths verify that each uploaded value is accompanied by a unique token from each
StatAuth that is valid for the current measurement period and that contains the same relay-weight bin. To infer a final global statistic from the anonymous per-relay estimates, the StatAuths use the weighted median of the received estimates, where the weight of an estimates is taken to be the smallest value of its accompanying bin (i.e. the bin's left edge).
I'm now wondering how useful this final statistic will be for us.
How useful is it to learn the median value of "client introduction requests" of all the relays of the network? And in our case we get an extrapolated weighted version of the median. How can such a number be used by us?
This median should be a good estimate of the network-wide total. An example of it is that blue line in the “single” plot of Figure 10 in the stats report that Karsten sent (that blue line was technically a mean rather than a median, but same idea). Each individual relay is actually reporting their estimate of the network-wide total, given their local observations, and the median among these should be pretty accurate.
Best, Aaron