Safely collecting data to estimate the number of Tor users

Sun Aug 29 17:06:13 UTC 2010

Hi,

On So, 2010-08-29 at 08:37 -0700, Robert Ransom wrote:
> And then it's more efficient to use FM sketches rather than a list of
> hash values.  The *only* problem I had with your first message in this
> thread was your suggestion that the FM sketches could be exchanged
> between the operators; the FM sketches may still be quite sensitive
> (only $AGENCY knows exactly how sensitive they are; we really can't
> tell).

granted.

> Other than the fact that the FM sketches must remain secret, your
> approach is also safer than Dr. Murdoch's idea.
> 
> There is a slight improvement, though:
> 
> * Each instrumented Tor instance should generate an FM sketch in memory
>   as it is contacted by directory clients.  This loses most of the
>   sensitive information instantly.
> 
> * Every N minutes, the Tor instance should encrypt its current FM
>   sketch with Dr. Loesing's public key for this purpose, base64-encode
>   it, and write it to the log on a single line.  As long as the
>   encryption algorithm is randomized, this packages up the rest of the
>   sensitive information so only one trusted person can recover it.  The
>   Tor instance then erases its FM sketch and starts accumulating a new
>   one.
> 
> * The Tor operators use one or more invocations of grep to extract
>   *only* the encrypted FM sketches from the log.  Sort the resulting
>   list, and delete the original log files.  This loses any potentially
>   sensitive timing information, so that Dr. Loesing can't recover it.
>   Each operator should then merge the encrypted-sketch lists produced
>   by all of his/her/its Tor relays into one sorted list before sending
>   it in for analysis.
> 
> * Dr. Loesing decrypts all of the FM sketches, ORs them together, and
>   publishes only the resulting count.  Then he erases the decrypted FM
>   sketches and sets his private key on fire.

Wait a minute, I'm losing track. If I got Karsten's and Stephen's
discussion right, they did not intend to exchange anything between
operators. Karsten wrote that "the operators do not give out these
logs to me or anyone else. I'm going to write Python scripts to analyze
the logs and publish them for the operators and others to review. The
operators will run these scripts and publish the results."

There are two options, which we should clearly keep apart. Option #1 is
to provide a tool chain to the directory mirror operators to collect and
evaluate how many distinct IP addresses contacted their mirrors, and
only their mirrors (i.e., a group of mirrors that is under the same
administrative control). An operator would then only publish the final
result, i.e., the number of distinct IP addresses for his group of
mirrors during some time period. The tool chain should of course still
keep as little sensitive information as possible. It could be
implemented in several different ways, including what Stephen suggested
and including a solution based on FM sketches.

The drawback of option #1 is that it can only provide information on the
number of users of each individual operator, not a total estimate for
the whole Tor network. The benefit is that - whatever the tool chain
looks like in detail - no potentially sensitive data needs to be
exchanged between different operators or between operators and Karsten
Loesing. As far as I got it, this option that Karsten and Stephen are
referring to.

Option #2 is to find some kind of mechanism that allows to merge
information from multiple operators, ideally without revealing sensitive
information to anyone. This would provide "better" information, as it
gives an estimate for the number of distinct users of the Tor network in
total. In his first mail, Karsten was asking for ideas how this could
possibly be achieved. This is what *I* was referring to in my first two
mails, and apparently this is also what you have in mind.

Variant #2a would be to craft a data structure in such a way that we can
be absolutely sure that it is no longer sensitive. In that case, the
operators could publish the data structure. I admit that this is most
likely impossible (even though I believe that sketches with additional
"noise" bits, as suggested in my second mail, come at least pretty
close).

Variant #2b is to craft the data structure in such a way that we are
*reasonably* sure that the data are no longer sensitive. In that case,
the data structures should be encrypted by the operators so that only
Karsten Loesing can decrypt them, and all traces of them should be
eliminated once Karsten has finished the merging and evaluation process,
just as you suggest.

Out of the variants discussed so far, sketches reveal the least
information; it seems that we agree on that. Adding random noise would
be a measure to provide additional protection in all scenarios, unless
the number of sketches to be ORed becomes too large.

There is one potential issue/limitation, though, in what you suggest. If
sketches are collected over N-minute intervals, and are conveyed to
Karsten without any indication of timing and order, then all Karsten can
do is to calculate the total number of distinct user IPs over the whole
period during which all these sketches have been generated. If this is
what is desired, it will be important a) that all operators collect
information during the same time interval, and only during that time
interval, and b) that this time interval is not too long, otherwise we
will end up with the number of distinct IP addresses during a whole week
or so, which might not be very helpful (due to dynamic IPs etc.).
Furthermore, if all sketches from this time period are combined by
Karsten anyway, then we might consider to let each operator OR all his
sketches locally and to send only the result to Karsten, so as to reveal
even less information to him.

Best regards

Björn