[This draft proposal is also available at git.torproject.org:users/zwol/torspec.git/proposals/ideas/xxx-exit-statistics.txt.]
Filename: xxx-exit-statistics.txt Title: Capturing Traffic Statistics from Exit Relays Author: Zack Weinberg Created: 04-Feb-2013 Status: Draft
1. Motivation
We propose to collect additional aggregate traffic information from exit nodes which will be of use to researchers, similar to how the project already collects statistics about its user population. We also propose to bring existing entry and exit statistics under the same umbrella so that anonymity protection can be applied in a principled fashion to the entire data set. There is some prior discussion in tickets #6002 and #6003.
Exit nodes currently collect statistics on the destination TCP ports of exiting traffic, but this is insufficient information for research; in particular, there have been repeated requests for information about destination hosts. Individual researchers have instrumented exit nodes on an ad-hoc basis (e.g. [1]) but the project has historically been skeptical about both the value and the safety of doing so (e.g. [2] warns of running afoul of wiretapping statutes; I understand that the [1] paper was met with severe criticism, but I can't find that right now). A principled, network-wide policy and mechanism for collecting exit metrics can satisfy the demand for data for research purposes while protecting our users and minimizing additional legal risk to exit node operators.
2. Design
At each exit node, we propose to measure the number of exiting TCP connections and total bytes transferred in each direction, per day, categorized three different ways:
* TCP port * "Public suffix" + 1 domain component of destination (example.com, example.co.uk) * Country of IP address of destination * possibly: ASN of destination and ASN of exit (this may reveal potential traffic-correlation attackers that would not be visible any other way, but may also be too fine-grained to be safe).
These are sufficiently coarse-grained that we believe they can be safely measured without damaging the network's anonymity guarantees; see below for further measures that will be taken to preserve anonymity. At the same time, they will enable interesting analyses of what the network is used _for_, in much the same way that the existing entry-side metrics enable analysis of _who_ uses the network.
At the same time, we propose to rationalize entry-side data collection, which currently relies on directory queries rather than actual traffic to actual entry nodes. This will improve accuracy and will also allow us to apply an anonymity-protection algorithm consistently to the entire data set available from the Metrics server.
Entry nodes (including bridges) should record entering TCP connections and traffic volume, per day, categorized by:
* Country of IP address of traffic source * ASN of IP address of traffic source, and ASN of entry node (subject to same caveat as above) * "I am a bridge" flag
(I think this is a superset of the information currently collected via directory queries. If I am mistaken, please let me know.)
All collected data should be passed through a differential-privacy sanitization algorithm before it leaves the Tor process's memory, and should then be uploaded to a central server (probably via a write-only hidden service API) which applies a second layer of sanitization. See below for further discussion.
3. Security implications
Any collection of information about the operation of the Tor network requires careful analysis to ensure that the anonymity of its users is preserved. In this case, we are contemplating adding to what is publicly available about what the network is used _for_, and we must ensure that this does not lead to a correlation attack that would reveal _who_ did a particular thing. For instance, suppose the adversary knows that _someone_ posted a "sensitive" document on a particular site on a particular date. They also have reason to suspect that the poster hails from a particular country. If our published metrics reveal that only one person used Tor to access that site on that date, and/or that only one person from that country made use of Tor on that date, the adversary's case against their suspect has now gotten considerably stronger.
The theoretical framework that deals with this class of exposure is called _differential privacy_. It rests on two observations. First, it is impossible to guarantee that _no one's_ privacy will be compromised via the release of statistics -- consider the somewhat contrived but still evocative case where the adversary already knows that Alice is 6cm shorter than the average citizen of Ruritania; publication of the Ruritanian average height reveals Alice's actual height. Note that Alice _does not_ have to be included in the average for this to work! But the adversary had to know something about Alice already. This leads to the second observation, which is that we _can_ make a reasonable privacy guarantee by changing the model. Instead of trying to protect everyone _whether or not_ they are in the database, we guarantee that, with high probability, the _released statistics_ would not be changed by the inclusion or removal of any one individual from the underlying data set. (This can be generalized to groups of N individuals, although of course the statistics get less and less accurate as N goes up.) Differential privacy prevents the example scenario described above: the adversary would only learn a _range_ of possible values for 'how many people visited site X today via Tor' and would therefore not be able to draw new conclusions about their suspect.
The basic technique of differential privacy is to add "noise" to the aggregate statistics before publishing them. There are a wide variety of "mechanisms" for doing so, of which the most basic is simply to add "independent samples from a Laplace distribution" to each statistic. Research in this area focuses on finding mechanisms that minimize the amount of noise required to provide a given level of privacy (according to a security parameter, 'epsilon' as usual). [3] is a good overview of the field and [4] proposes a particular algorithm that seems well-suited for the data set we're looking at here.
We must also consider the security of the reporting process. The Metrics server is already trusted, and applying "noise" there will allow us to minimize the overall amount of noise applied for the same level of differential privacy. However, if this is the only place noise is applied, a malicious node operator could capture the 'raw' statistics by reconfiguring their node to report to a server that doesn't sanitize (but perhaps forwards the report to the real server, avoiding suspicion). If each node applies some level of noise before the statistics ever leave the Tor process's memory, this class of attack is much harder, although not impossible (by modifying code or extracting raw statistics from process memory with a debugger).
There is at least one proposal [5] for distributed sanitization of a data set. It is Byzantine fault-tolerant and therefore _may_ be overkill for our purposes ... but we're talking about malicious node operators here, so maybe it isn't!
4. Specification
TBD
5. Compatibility
should be moot
6. Implementation
TBD
7. Performance and Scalability
minor overhead at each exit (and entry); additional load on Metrics; if reporting is via hidden service, additional load on the network
8. References
[1] "Shining Light in Dark Places: Understanding the Tor Network". McCoy, Bauer, Grunwald, Kohno, Sicker. PETS 2008. http://freehaven.net/anonbib/cache/mccoy-pet2008.pdf
[2] https://www.torproject.org/eff/tor-legal-faq.html.en
[3] "Differential Privacy for Statistics: What we Know and What we Want to Learn". Dwork and Smith. Journal of Privacy and Confidentiality, 2009. http://repository.cmu.edu/jpc/vol1/iss2/2/
[4] "Optimizing [Linear/Histogram] Counting Queries under Differential Privacy". Li, Hay, Rastogi, Miklau, McGregor. Principles of Database Systems, 2010. http://arxiv.org/abs/0912.4742
[5] "Our data, ourselves: Privacy via distributed noise generation". Dwork, Kenthapadi, McSherry, Mironov, Naor. EUROCRYPT 2006. http://www.truststc.org/pubs/101/ourDataOurselves.pdf
Hi,
On 18.03.2013 12:05, Zack Weinberg wrote:
- TCP port
- "Public suffix" + 1 domain component of destination (example.com, example.co.uk)
I am not sure I like this. Maybe we might want to limit it to popular destinations -- drop sites that only get few hits? And rougher access numbers (50 hits, 100, etc)?
-Moritz
On Monday, March 18, 2013, Moritz Bartl wrote:
On 18.03.2013 12:05, Zack Weinberg wrote:
- TCP port
- "Public suffix" + 1 domain component of destination (example.com, example.co.uk)
I am not sure I like this. Maybe we might want to limit it to popular destinations -- drop sites that only get few hits? And rougher access numbers (50 hits, 100, etc)?
The "differential privacy" sanitization algorithm discussed in the next section is in fact a more systematic and theoretically grounded way of doing just this. Sites that are rarely visited will have their true visit rate overwhelmed by the added noise, which can either add to or subtract from the number. Sites that are frequently visited will simply have their true visit count rendered uncertain.
I shall look into the possibility of adding completely fake visits to the statistics as well.