Safely collecting data to estimate the number of Tor users
rransom.8774 at gmail.com
Sun Aug 29 15:37:07 UTC 2010
On Sun, 29 Aug 2010 13:17:35 +0200
Björn Scheuermann <scheuermann at cs.uni-duesseldorf.de> wrote:
> Steven J. Murdoch wrote:
> > Another thing to do is to shrink the hash size. If we assume that
> > there are going to be, say, no more than 1 million distinct IP
> > addresses, we could use a 40 bit hash with only a small number of
> > collisions (due to the Birthday Paradox). However, someone who tries
> > to reverse the full 32 bit IPv4 space will get many collisions.
> with a 40 bit hash, the vast majority of 2^32 IP addresses will not
> collide with any other IP address, despite the birthday paradox. So this
> would not really be a significant additional level of protection. It
> would make a difference only for the small fraction of the IP addresses
> that actually do collide.
> However, you don't need that many bits anyway, as collisions are not a
> big issue when your goal is to only count how many different addresses
> you have seen. If you proceed as you suggested, you can, as detailed
> below, do well with a 21-bit hash. With additional tweaks, you can
> further reduce that to 12-bit hash values.
And then it's more efficient to use FM sketches rather than a list of
hash values. The *only* problem I had with your first message in this
thread was your suggestion that the FM sketches could be exchanged
between the operators; the FM sketches may still be quite sensitive
(only $AGENCY knows exactly how sensitive they are; we really can't
Other than the fact that the FM sketches must remain secret, your
approach is also safer than Dr. Murdoch's idea.
There is a slight improvement, though:
* Each instrumented Tor instance should generate an FM sketch in memory
as it is contacted by directory clients. This loses most of the
sensitive information instantly.
* Every N minutes, the Tor instance should encrypt its current FM
sketch with Dr. Loesing's public key for this purpose, base64-encode
it, and write it to the log on a single line. As long as the
encryption algorithm is randomized, this packages up the rest of the
sensitive information so only one trusted person can recover it. The
Tor instance then erases its FM sketch and starts accumulating a new
* The Tor operators use one or more invocations of grep to extract
*only* the encrypted FM sketches from the log. Sort the resulting
list, and delete the original log files. This loses any potentially
sensitive timing information, so that Dr. Loesing can't recover it.
Each operator should then merge the encrypted-sketch lists produced
by all of his/her/its Tor relays into one sorted list before sending
it in for analysis.
* Dr. Loesing decrypts all of the FM sketches, ORs them together, and
publishes only the resulting count. Then he erases the decrypted FM
sketches and sets his private key on fire.
-------------- next part --------------
A non-text attachment was scrubbed...
Size: 198 bytes
Desc: not available
More information about the tor-dev