-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
On 15/01/16 23:00, Rob Jansen wrote:
Hello,
Hi Rob,
I'm moving this discussion from metrics-team@ to tor-dev@, because I think it's relevant for little-t-tor devs who are not subscribed to metrics-team@. Hope you don't mind.
I was recently reviewing the statistics that Tor allows relays to collect and report to the dir servers [1], which then get published in extra-info documents [2]. Most of this can be enabled by simply setting a torrc option. There are quite a few statistics that I feel should not be collected. I'm wondering if the original purpose for collecting many of these statistics still exists, and if we still feel that the privacy compromises that were made when the collection was implemented are still valid in most cases.
Here are the stats I am most worried about, and why:
[unique ips per country code] *-ips (there are many of these, e.g. "entry-ips") Usually this involves storing individual user IP addresses in memory (in order to track uniqueness) over some period of time (usually 24 hours), sometimes for longer than the user would have otherwise been known to Tor (if a user's session is 1 hour, Tor could remember the IP for at most 23 additional hours). This is reported, e.g., per entry; there are many cases in the data where it is very likely that only one user is connecting to a guard from a given country (because it is rounded up to 8). Users in small countries have the greatest risk (intersection attacks become really easy).
I agree that might just lose these statistics. We used them in the past as first approximation to counting users, but obviously that only works as long as clients only connect to a single relay. The only place where we're still using them is in a workaround for estimating bridge users. See #15469 for more details and #8786 for something we'd have to implement before taking these statistics out.
[exit statistics by port number] exit-kibibytes-written exit-kibibytes-read exit-streams-opened Tor is classifying its traffic into ports, which could uniquely identify the application being used by the client. They also track bandwidth usage per port (and per exit); again, this is bad for those using a random or unique looking ports (that a given exit does not see very often) because it could be used to create a fingerprint. Intersection attacks become easier with this information.
Agreed, I can see us dropping these statistics, too. We're currently not using them. But also see my suggestion below.
The less problematic stats:
[circuit-based cell statistics] cell-processed-cells cell-queued-cells cell-time-in-queue cell-circuits-per-decile This provides queue timings and number of cells being processed at a relay. The number of cells can be used to compute bandwidth of circuits. It may be possible to launch some attacks that create several circuits with the intent of moving which decile buckets some legitimate circuits get placed into, but this is less worrisome of an attack than the others.
I'm less worried about this one. But, suggestion below.
Should Tor still be collecting these things? Should Tor disable the collection of these statistics until we have a more privacy-preserving way to collect and aggregate them?
The good news is that privacy-preserving techniques exist that can reduce information leakage. I'm developing a tool based on the secret-sharing variant of PrivEx [3] to collect some of these types of statistics while providing privacy guarantees. We are currently using it to collect only those stats that are useful for producing Tor traffic models. A great advantage of this tool is that the various counters that we store during the collection phase get noise added and are randomized during initialization; only the aggregates are ever known and revealed by the aggregation server, limiting the information that is lost if a relay is compromised. This is a large improvement over the current collection method, which only adds noise before publication and reveals statistics on a per-relay basis.
Suggestion: How about we evaluate these statistics published by relays in the past years to see if there are other benefits or risks we didn't think of, and then we decide whether to leave them in, modify them, or take them out?
The reason is that I'd want to avoid removing this code only to realize shortly after that we overlooked a good reason for keeping it. These statistics are being collected for years now, and it might take another year or so for relays to upgrade to stop collecting them. So what's another month.
Thanks for (re-)starting this discussion!
All the best, Rob
All the best, Karsten
[1] https://gitweb.torproject.org/torspec.git/tree/dir-spec.txt [2] https://collector.torproject.org/recent/relay-descriptors/extra-infos/
[3] www.cypherpunks.ca/~iang/pubs/privex-ccs14.pdf
_______________________________________________ metrics-team mailing list metrics-team@lists.torproject.org https://lists.torproject.org/cgi-bin/mailman/listinfo/metrics-team