Safely collecting data to estimate the number of Tor users
Steven J. Murdoch
tor+Steven.Murdoch at cl.cam.ac.uk
Sat Aug 28 09:21:00 UTC 2010
[Following up from our IRC conversation]
On Thu, Aug 26, 2010 at 01:31:14PM +0200, Karsten Loesing wrote:
> So, here's my plan for researching this more: I'd like to run an
> experiment with multiple fast directory mirrors run by the same operator
> on the same host (like Jake's trusted and Pandora*, Olaf's blutmagie*,
> Moritz's torserversNet*, etc.). I'm going to write a patch for Tor to
> accept some key string in its torrc and extend SafeLogging to accept the
> value 'encrypt'. Tor will then pass all client IP addresses through a
> keyed hash function using the provided key string and write the result
> to its logs. I'm also going to implement #1668 to make log granularity
> configurable. The operators configure the same key string for all their
> relays and run them with the new SafeLogging option and logging
> granularity of 15 minutes for, say, a week. Operators then delete the
> key string and only keep the logs. The operators do not give out these
> logs to me or anyone else. I'm going to write Python scripts to analyze
> the logs and publish them for the operators and others to review. The
> operators will run these scripts and publish the results.
That sounds fine, but as I am sure you are aware, since there are only
2^32 IP addresses, given the key string, it will be possible to
reverse the keyed hash. This key string will be on hard disks, so I'd
therefore suggest adding an additional level of security.
On an offline analysis machine you could generate a public keypair.
Give the public half to the server operators.
On the directory mirrors, Tor then generates a symmetric key which
it keeps in RAM. Tor logs the symmetric key encrypted under the public
key. Then the hash would be encrypted under this symmetric key. Each
time Tor restarts it generates a new key.
On the offline analysis machine, decrypt the records from all the
directory mirrors, and encrypt them under a single new key. Discard
all the keys and only export the encrypted logs.
Another thing to do is to shrink the hash size. If we assume that
there are going to be, say, no more than 1 million distinct IP
addresses, we could use a 40 bit hash with only a small number of
collisions (due to the Birthday Paradox). However, someone who tries
to reverse the full 32 bit IPv4 space will get many collisions.
More information about the tor-dev