[metrics-team] Should we disable the collection of some stats published in extra-infos?

Rob Jansen rob.g.jansen at nrl.navy.mil
Fri Jan 15 22:00:04 UTC 2016


Hello,

I was recently reviewing the statistics that Tor allows relays to collect and report to the dir servers [1], which then get published in extra-info documents [2]. Most of this can be enabled by simply setting a torrc option. There are quite a few statistics that I feel should not be collected. I'm wondering if the original purpose for collecting many of these statistics still exists, and if we still feel that the privacy compromises that were made when the collection was implemented are still valid in most cases.

Here are the stats I am most worried about, and why:

[unique ips per country code]
*-ips (there are many of these, e.g. "entry-ips")
Usually this involves storing individual user IP addresses in memory (in order to track uniqueness) over some period of time (usually 24 hours), sometimes for longer than the user would have otherwise been known to Tor (if a user's session is 1 hour, Tor could remember the IP for at most 23 additional hours). This is reported, e.g., per entry; there are many cases in the data where it is very likely that only one user is connecting to a guard from a given country (because it is rounded up to 8). Users in small countries have the greatest risk (intersection attacks become really easy).

[exit statistics by port number]
exit-kibibytes-written
exit-kibibytes-read
exit-streams-opened
Tor is classifying its traffic into ports, which could uniquely identify the application being used by the client. They also track bandwidth usage per port (and per exit); again, this is bad for those using a random or unique looking ports (that a given exit does not see very often) because it could be used to create a fingerprint. Intersection attacks become easier with this information.

The less problematic stats:

[circuit-based cell statistics]
cell-processed-cells
cell-queued-cells
cell-time-in-queue
cell-circuits-per-decile
This provides queue timings and number of cells being processed at a relay. The number of cells can be used to compute bandwidth of circuits. It may be possible to launch some attacks that create several circuits with the intent of moving which decile buckets some legitimate circuits get placed into, but this is less worrisome of an attack than the others.

Should Tor still be collecting these things? Should Tor disable the collection of these statistics until we have a more privacy-preserving way to collect and aggregate them?

The good news is that privacy-preserving techniques exist that can reduce information leakage. I'm developing a tool based on the secret-sharing variant of PrivEx [3] to collect some of these types of statistics while providing privacy guarantees. We are currently using it to collect only those stats that are useful for producing Tor traffic models. A great advantage of this tool is that the various counters that we store during the collection phase get noise added and are randomized during initialization; only the aggregates are ever known and revealed by the aggregation server, limiting the information that is lost if a relay is compromised. This is a large improvement over the current collection method, which only adds noise before publication and reveals statistics on a per-relay basis.

All the best,
Rob

[1] https://gitweb.torproject.org/torspec.git/tree/dir-spec.txt
[2] https://collector.torproject.org/recent/relay-descriptors/extra-infos/
[3] www.cypherpunks.ca/~iang/pubs/privex-ccs14.pdf
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 842 bytes
Desc: Message signed with OpenPGP using GPGMail
URL: <http://lists.torproject.org/pipermail/metrics-team/attachments/20160115/d8e23931/attachment.sig>


More information about the metrics-team mailing list