On Tue, Jan 6, 2015 at 12:14 PM, A. Johnson aaron.m.johnson@nrl.navy.mil wrote:
Hello tor-dev,
While helping design ways to publish statistics about hidden services in a privacy-preserving manner, it has become clear to me that certain statistics cannot be safely reported using the current method of having each relay collect and report measurements. I am going to describe a couple of simple protocols to handle this problem that I think should be implementable without much effort. I'd be happy to get feedback in particular about the security or ease-of-implementation of these protocols.
Two HS statistics that we (i.e. people working on Sponsor R) are interested in collecting are:
- The number of descriptor fetches received by a hidden-service directory (HSDir)
- The number of client introduction requests at an introduction points (IPs)
The privacy issue with #1 is that the set of HSDirs is (likely) unique to an HS, and so the number of descriptor fetches at its HSDirs could reveal the number of clients it had during a measurement period. Similarly, the privacy issue with #2 is that the set of IPs are (likely) unique to an HS, and so the number of client introductions at its IPs could reveal the number of client connections it received.
A approach to solve this problem would be to anonymize the reported statistics. Doing so raises a couple of challenges, however:
- Anonymous statistics should be authenticated as coming from some relay. Otherwise, statistics
could be polluted by any malicious actor. 2. Statistical inference should be made robust to outliers. Without the relay identities, it will be difficult to detect and remove values that are incorrect, whether due to faulty measurement or malicious action by a relay.
You know, when I got to the above paragraph, I asked myself, "Well clearly *I'd* use blind signatures, but I wonder what Aaron is going to suggest?"
And then I saw:
I propose some simple cryptographic techniques to privately collect the above statistics while handling the above challenges.
:)
I think that there are some details to work out, but the general approach you describe sounds reasonable. IMO it doesn't need to be directory authorities who are StatsAuths, and we could use a "blinded token once per relay per period" scheme for other stuff too down the line.
Median-and-quartiles or median-and-deciles sounds smarter than just-average to me.
cheers,