Hi Samir,
It is my understanding that the Tor metrics team plans to handle this problem in a different way. IPs are kept in memory to provide statistics about users’ countries, and so they will instead just keep the country statistics directly. That is, a counter will be kept for all countries, upon the establishment of a new "OR connection” (a Tor term that I believe translates to a TLS connection) the IP address will be mapped to a country, and then that country’s counter will be incremented. As is done currently, further privacy-preserving techniques would be applied to these counters before publishing them, such as rounding, adding random noise, or removing some of the counters. These counters could even potentially be locally stored in a differentially-private way, which would make the local counters even less interesting to a possible attacker. The suitability of adding such local noise depends on on how inaccurate this would make the results.
You may wish to contact Karsten Loesing of the Tor Metrics team to verify my understanding.
Best, Aaron
On Apr 1, 2017, at 7:19 AM, Florian Tschorsch tschorsch@informatik.hu-berlin.de wrote:
Hi Samir,
this sounds like an interesting summer project.
Since you are interested in using PCSA, our work on privacy-preserving statistics, which actually develops a privacy-enhanced version of PCSA, might be helpful. We also propose it as a way to collect distributed statistics.
In our HotPETs paper [1], we sketch the basic idea. In our journal paper [2], we provide additional details on the algorithm. If you have any questions, just let me know.
Cheers, Florian.
[1] https://petsymposium.org/2011/papers/hotpets11-final5Tschorsch.pdf [2] https://www.sciencedirect.com/science/article/pii/S1389128613001941
On 30. Mar 2017, at 03:45, samir menon menon.samir@gmail.com wrote:
Hi there!
I'm Samir, a Computer Science student at Stanford University, with a focus in applied cryptography and computer security. This summer, I want to work (through GSoC) on computing usage statistics without keeping IP addresses in memory (see tickets #7532 and #15469) [1] [2].
Currently, we keep sets of IP's (or hashed IP's) in memory so that we can compute the number of unique client connections. This has been pointed out as a pretty serious concern, because the IP's themselves are sensitive info that we don't want an attacker to acquire, but the statistics are relatively valuable.
As Nick first pointed out in #15469, we can use proven techniques to compute these statistics without actually explicitly storing any IP's (or IP hashes) in memory. The technique I want to use, "Probabilistic Counting with Stochastic Averaging", or PCSA, is relatively well-studied, and can provide good estimates (<5% error) of the number of unique elements in a time series.
The basic idea is to count the number of 0's before the least significant 1 in every (Jenkins hashed) IP, and then recognize that the more unique IP's we encounter, the more likely it is that we see a hashed IP with a large number of 0's before the least significant 1. (Shoutout to Jaskaran and [3] for helping me understand this). A more detailed explanation and more resources for understanding PCSA are in the proposal.
Here is my draft proposal (also attached, but links don't work): http://stanford.edu/~samir2/TorGSoCApplication.html
I'd love to hear feedback on it - what's feasible, what's most useful, and what I should focus on, etc. You can also chat with me about it on IRC at `samir2`!
Thanks, ~Samir Menon menon.samir@gmail.com Stanford University, B.S. Computer Science, 2019
[1] https://trac.torproject.org/projects/tor/ticket/7532 [2] https://trac.torproject.org/projects/tor/ticket/15469 [3] https://www.cs.princeton.edu/~rs/talks/AC11-Cardinality.pdf <TorGSoCAnonymousLocalStats.pdf>_______________________________________________ tor-dev mailing list tor-dev@lists.torproject.org https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-dev
tor-dev mailing list tor-dev@lists.torproject.org https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-dev