# Safely collecting data to estimate the number of Tor users

Björn Scheuermann scheuermann at cs.uni-duesseldorf.de
Fri Aug 27 11:12:08 UTC 2010

```Hi,

> If an adversary knows that only one IP address with a certain hash
> value could possibly be using Tor, the adversary can use an operator's
> FM sketch to determine whether or not that IP address accessed the
> operator's directory mirror in the FM sketch's time period. The
> tin-foil-hat crowd and their parodists will also point out that the
> adversary may design the hash function to single out certain users.
> (Are you enjoying that CIA time-share, Dr. Loesing?)

you certainly have a point there. Philosophizing about whether this is
practically doable and relevant (esp. if you use a cryptographic hash as
a basis) is probably futile.

Yet, I almost expected that you guys would come up with this argument,
so I have a fallback at hand. ;-) It appears to me that the problem can
be overcome relatively easily if you do the following: instead of
initializing the FM sketch to all-zero, you set each individual bit to
one with a certain, fixed probability. For instance, you could
initialize the bit field such that each bit is zero with 95%
probability, and one with 5% probability. You do that independently and
randomly at each relay, and never communicate the generated initial bit
pattern to anyone.

The additional one bits avoid the problem that you can know for sure
from a 1-bit in the sketch that there was "at least one" IP address
which has been mapped to that particular bit. In fact, for those bits
that are hit by only very few IP addresses, if the respective bit is 1
in the sketch, then the probability that it has been set to one by the
above stated random initialization is much higher than the probability
that the respective IP address really contacted the relay.

When an estimate for the total number is extracted from such a sketch,
However, the resulting bias is relatively easy to eliminate if the
probability with which bits have been initialized to one is known (also
for combined bit fields from multiple relays, as long as neither the
number of relays nor the fraction of "artificial" 1-bits becomes too
large).

Better?

Cheers

Björn

--
Jun.-Prof. Dr. Björn Scheuermann
Mobile and Decentralized Networks Group
Heinrich Heine University
Universitätsstr. 1, D-40225 Düsseldorf, Germany

Building 25.12, Room 02.42
Tel: +49 211 81 11692
Fax: +49 211 81 11638
scheuermann at cs.uni-duesseldorf.de
http://www.cn.uni-duesseldorf.de

```