Safely collecting data to estimate the number of Tor users

Karsten Loesing karsten.loesing at
Mon Aug 30 09:09:07 UTC 2010

Hi Björn, Robert, Steven,

woah, quite some discussion on the topic. That's great! Sorry for the
delay, but, as you all know, weekends are CIA time... ;)

So, I think my original posting lead to some confusion about what I'm
trying to achieve. Björn did a great job summarizing this in his 29 Aug
2010 19:06:13 +0200 email. Let me make another attempt to describe my
intentions and sort in your thoughts:

There are three things that I'm interested in:

1) I want to analyze 'anonymized' directory requests on a couple of
directory mirrors run by the same operator (who doesn't share any data
but the results with me or anyone else) for a limited time to

2) learn more about merging unique IP address sets from many directory
mirrors run by different operators on a regular basis and

3) learn more about client uptime to better translate network status
requests to an overall client number, also on a regular basis.

In my initial posting I was mostly referring to 1) and only briefly
mentioned 2) and 3). Most of your discussion so far was about 2). In
particular, the following ideas were discussed (greatly simplified):

a) Björn introduced the idea of using FM sketches, possibly initialized
with some false 1 bits, to solve problem 2) above. AFAIU, the FM sketch
idea is comparable to my Bloom filter idea (4.2 in countingusers.pdf as
linked from my initial posting), despite using a 'specifically crafted
hash function'. I discussed my Bloom filter idea with Steven in July in
Berlin and he was also concerned that publishing Bloom filters could
leak sensitive information. I'd very much like to work on this idea more
to evaluate the risks when choosing a rather small filter size. The goal
should be to find out if we can make all directory mirrors publish their
filters to the directory authorities where everyone can download them or

b) Robert is working on merging encrypted Bloom-filter-like objects. I'm
interested in this idea, too, even though it implies we fail in a), that
is, we're unable to make the Bloom filters safe enough to publish
without encryption. I'm mostly concerned about having a single trusted
party and would rather distribute trust among, say, the majority of the
currently eight directory authority operators. Still, having a solution
that doesn't require additional encryption would be best.

c) Steven proposes to i) encrypt logs to a public key (or rather to a
symmetric session key which is encrypted to a public key) and ii) to
reduce IP address hashes in those logs to 40 bits. That means he's
referring to problem 1) above. I think that i) is a good approach to
move sensitive logs from an Internet host to a more secure place to run
the evaluation on. I could imagine implementing this to be a general Tor
feature, so that people who need verbose logs for debugging can encrypt
them on their server and evaluate them on a safe machine. I'm slightly
concerned that this could encourage people to log more than they need. I
also like idea ii), because we really don't need 160 bit hashes in the
logs, but should be fine with 40 bits.

So, here's what I'd like to do next:

I) Research the FM-or-Bloom-filter idea analytically, gladly accepting
Björn's offer to help,

II) implement log sanitizing and encryption to keep as little sensitive
information in logs as possible,

III) design and have someone run an experiment to evaluate how well the
filter idea works in practice, and

IV) design and run an experiment to learn about client sessions.

Thanks for your input so far! More comments/corrections are highly


More information about the tor-dev mailing list