[tor-dev] Proposal: Capturing Traffic Statistics from Exit Relays

Mon Mar 18 16:05:38 UTC 2013

[This draft proposal is also available at
git.torproject.org:users/zwol/torspec.git/proposals/ideas/xxx-exit-statistics.txt.]

Filename: xxx-exit-statistics.txt
Title: Capturing Traffic Statistics from Exit Relays
Author: Zack Weinberg
Created: 04-Feb-2013
Status: Draft

1. Motivation

We propose to collect additional aggregate traffic information from
exit nodes which will be of use to researchers, similar to how the
project already collects statistics about its user population.  We
also propose to bring existing entry and exit statistics under the
same umbrella so that anonymity protection can be applied in a
principled fashion to the entire data set.  There is some prior
discussion in tickets #6002 and #6003.

Exit nodes currently collect statistics on the destination TCP ports
of exiting traffic, but this is insufficient information for research;
in particular, there have been repeated requests for information about
destination hosts.  Individual researchers have instrumented exit
nodes on an ad-hoc basis (e.g. [1]) but the project has historically
been skeptical about both the value and the safety of doing so
(e.g. [2] warns of running afoul of wiretapping statutes; I understand
that the [1] paper was met with severe criticism, but I can't find
that right now).  A principled, network-wide policy and mechanism for
collecting exit metrics can satisfy the demand for data for research
purposes while protecting our users and minimizing additional legal
risk to exit node operators.

2. Design

At each exit node, we propose to measure the number of exiting TCP
connections and total bytes transferred in each direction, per day,
categorized three different ways:

 * TCP port
 * "Public suffix" + 1 domain component of destination
   (example.com, example.co.uk)
 * Country of IP address of destination
 * possibly: ASN of destination and ASN of exit (this may reveal
   potential traffic-correlation attackers that would not be visible
   any other way, but may also be too fine-grained to be safe).

These are sufficiently coarse-grained that we believe they can be
safely measured without damaging the network's anonymity guarantees;
see below for further measures that will be taken to preserve
anonymity.  At the same time, they will enable interesting analyses of
what the network is used _for_, in much the same way that the existing
entry-side metrics enable analysis of _who_ uses the network.

At the same time, we propose to rationalize entry-side data
collection, which currently relies on directory queries rather than
actual traffic to actual entry nodes.  This will improve accuracy and
will also allow us to apply an anonymity-protection algorithm
consistently to the entire data set available from the Metrics server.

Entry nodes (including bridges) should record entering TCP connections
and traffic volume, per day, categorized by:

 * Country of IP address of traffic source
 * ASN of IP address of traffic source, and ASN of entry node
   (subject to same caveat as above)
 * "I am a bridge" flag

(I think this is a superset of the information currently collected via
directory queries.  If I am mistaken, please let me know.)

All collected data should be passed through a differential-privacy
sanitization algorithm before it leaves the Tor process's memory, and
should then be uploaded to a central server (probably via a write-only
hidden service API) which applies a second layer of sanitization.  See
below for further discussion.

3. Security implications

Any collection of information about the operation of the Tor network
requires careful analysis to ensure that the anonymity of its users is
preserved.  In this case, we are contemplating adding to what is
publicly available about what the network is used _for_, and we must
ensure that this does not lead to a correlation attack that would
reveal _who_ did a particular thing.  For instance, suppose the
adversary knows that _someone_ posted a "sensitive" document on a
particular site on a particular date.  They also have reason to
suspect that the poster hails from a particular country.  If our
published metrics reveal that only one person used Tor to access that
site on that date, and/or that only one person from that country made
use of Tor on that date, the adversary's case against their suspect
has now gotten considerably stronger.

The theoretical framework that deals with this class of exposure is
called _differential privacy_.  It rests on two observations.  First,
it is impossible to guarantee that _no one's_ privacy will be
compromised via the release of statistics -- consider the somewhat
contrived but still evocative case where the adversary already knows
that Alice is 6cm shorter than the average citizen of Ruritania;
publication of the Ruritanian average height reveals Alice's actual
height.  Note that Alice _does not_ have to be included in the average
for this to work!  But the adversary had to know something about Alice
already.  This leads to the second observation, which is that we _can_
make a reasonable privacy guarantee by changing the model.  Instead of
trying to protect everyone _whether or not_ they are in the database,
we guarantee that, with high probability, the _released statistics_
would not be changed by the inclusion or removal of any one individual
from the underlying data set.  (This can be generalized to groups of N
individuals, although of course the statistics get less and less
accurate as N goes up.) Differential privacy prevents the example
scenario described above: the adversary would only learn a _range_ of
possible values for 'how many people visited site X today via Tor' and
would therefore not be able to draw new conclusions about their
suspect.

The basic technique of differential privacy is to add "noise" to the
aggregate statistics before publishing them.  There are a wide variety
of "mechanisms" for doing so, of which the most basic is simply to add
"independent samples from a Laplace distribution" to each statistic.
Research in this area focuses on finding mechanisms that minimize the
amount of noise required to provide a given level of privacy
(according to a security parameter, 'epsilon' as usual). [3] is a good
overview of the field and [4] proposes a particular algorithm that
seems well-suited for the data set we're looking at here.

We must also consider the security of the reporting process.  The
Metrics server is already trusted, and applying "noise" there will
allow us to minimize the overall amount of noise applied for the same
level of differential privacy.  However, if this is the only place
noise is applied, a malicious node operator could capture the 'raw'
statistics by reconfiguring their node to report to a server that
doesn't sanitize (but perhaps forwards the report to the real server,
avoiding suspicion).  If each node applies some level of noise before
the statistics ever leave the Tor process's memory, this class of
attack is much harder, although not impossible (by modifying code or
extracting raw statistics from process memory with a debugger).

There is at least one proposal [5] for distributed sanitization of a
data set.  It is Byzantine fault-tolerant and therefore _may_ be
overkill for our purposes ... but we're talking about malicious node
operators here, so maybe it isn't!

4. Specification

TBD

5. Compatibility

should be moot

6. Implementation

TBD

7. Performance and Scalability

minor overhead at each exit (and entry); additional load on Metrics;
if reporting is via hidden service, additional load on the network

8. References

[1] "Shining Light in Dark Places: Understanding the Tor Network".
    McCoy, Bauer, Grunwald, Kohno, Sicker.  PETS 2008.
    http://freehaven.net/anonbib/cache/mccoy-pet2008.pdf

[2] https://www.torproject.org/eff/tor-legal-faq.html.en

[3] "Differential Privacy for Statistics: What we Know and What we
    Want to Learn".  Dwork and Smith.  Journal of Privacy and
    Confidentiality, 2009.  http://repository.cmu.edu/jpc/vol1/iss2/2/

[4] "Optimizing [Linear/Histogram] Counting Queries under Differential
    Privacy". Li, Hay, Rastogi, Miklau, McGregor.  Principles of
    Database Systems, 2010.  http://arxiv.org/abs/0912.4742

[5] "Our data, ourselves: Privacy via distributed noise generation".
    Dwork, Kenthapadi, McSherry, Mironov, Naor.  EUROCRYPT 2006.
    http://www.truststc.org/pubs/101/ourDataOurselves.pdf