Hi there!
I'm Samir, a Computer Science student at Stanford University, with a focus in applied cryptography and computer security. This summer, I want to work (through GSoC) on computing usage statistics without keeping IP addresses in memory (see tickets #7532 and #15469) [1] [2].
Currently, we keep sets of IP's (or hashed IP's) in memory so that we can compute the number of unique client connections. This has been pointed out as a pretty serious concern, because the IP's themselves are sensitive info that we don't want an attacker to acquire, but the statistics are relatively valuable.
As Nick first pointed out in #15469, we can use proven techniques to compute these statistics without actually explicitly storing any IP's (or IP hashes) in memory. The technique I want to use, "Probabilistic Counting with Stochastic Averaging", or PCSA, is relatively well-studied, and can provide good estimates (<5% error) of the number of unique elements in a time series.
The basic idea is to count the number of 0's before the least significant 1 in every (Jenkins hashed) IP, and then recognize that the more unique IP's we encounter, the more likely it is that we see a hashed IP with a large number of 0's before the least significant 1. (Shoutout to Jaskaran and [3] for helping me understand this). A more detailed explanation and more resources for understanding PCSA are in the proposal.
Here is my draft proposal (also attached, but links don't work): http://stanford.edu/~samir2/TorGSoCApplication.html
I'd love to hear feedback on it - what's feasible, what's most useful, and what I should focus on, etc. You can also chat with me about it on IRC at `samir2`!
Thanks, ~Samir Menon menon.samir@gmail.com Stanford University, B.S. Computer Science, 2019
[1] https://trac.torproject.org/projects/tor/ticket/7532 [2] https://trac.torproject.org/projects/tor/ticket/15469 [3] https://www.cs.princeton.edu/~rs/talks/AC11-Cardinality.pdf
Hi Samir,
this sounds like an interesting summer project.
Since you are interested in using PCSA, our work on privacy-preserving statistics, which actually develops a privacy-enhanced version of PCSA, might be helpful. We also propose it as a way to collect distributed statistics.
In our HotPETs paper [1], we sketch the basic idea. In our journal paper [2], we provide additional details on the algorithm. If you have any questions, just let me know.
Cheers, Florian.
[1] https://petsymposium.org/2011/papers/hotpets11-final5Tschorsch.pdf [2] https://www.sciencedirect.com/science/article/pii/S1389128613001941
On 30. Mar 2017, at 03:45, samir menon menon.samir@gmail.com wrote:
Hi there!
I'm Samir, a Computer Science student at Stanford University, with a focus in applied cryptography and computer security. This summer, I want to work (through GSoC) on computing usage statistics without keeping IP addresses in memory (see tickets #7532 and #15469) [1] [2].
Currently, we keep sets of IP's (or hashed IP's) in memory so that we can compute the number of unique client connections. This has been pointed out as a pretty serious concern, because the IP's themselves are sensitive info that we don't want an attacker to acquire, but the statistics are relatively valuable.
As Nick first pointed out in #15469, we can use proven techniques to compute these statistics without actually explicitly storing any IP's (or IP hashes) in memory. The technique I want to use, "Probabilistic Counting with Stochastic Averaging", or PCSA, is relatively well-studied, and can provide good estimates (<5% error) of the number of unique elements in a time series.
The basic idea is to count the number of 0's before the least significant 1 in every (Jenkins hashed) IP, and then recognize that the more unique IP's we encounter, the more likely it is that we see a hashed IP with a large number of 0's before the least significant 1. (Shoutout to Jaskaran and [3] for helping me understand this). A more detailed explanation and more resources for understanding PCSA are in the proposal.
Here is my draft proposal (also attached, but links don't work): http://stanford.edu/~samir2/TorGSoCApplication.html
I'd love to hear feedback on it - what's feasible, what's most useful, and what I should focus on, etc. You can also chat with me about it on IRC at `samir2`!
Thanks, ~Samir Menon menon.samir@gmail.com Stanford University, B.S. Computer Science, 2019
[1] https://trac.torproject.org/projects/tor/ticket/7532 [2] https://trac.torproject.org/projects/tor/ticket/15469 [3] https://www.cs.princeton.edu/~rs/talks/AC11-Cardinality.pdf <TorGSoCAnonymousLocalStats.pdf>_______________________________________________ tor-dev mailing list tor-dev@lists.torproject.org https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-dev
Hi Samir,
It is my understanding that the Tor metrics team plans to handle this problem in a different way. IPs are kept in memory to provide statistics about users’ countries, and so they will instead just keep the country statistics directly. That is, a counter will be kept for all countries, upon the establishment of a new "OR connection” (a Tor term that I believe translates to a TLS connection) the IP address will be mapped to a country, and then that country’s counter will be incremented. As is done currently, further privacy-preserving techniques would be applied to these counters before publishing them, such as rounding, adding random noise, or removing some of the counters. These counters could even potentially be locally stored in a differentially-private way, which would make the local counters even less interesting to a possible attacker. The suitability of adding such local noise depends on on how inaccurate this would make the results.
You may wish to contact Karsten Loesing of the Tor Metrics team to verify my understanding.
Best, Aaron
On Apr 1, 2017, at 7:19 AM, Florian Tschorsch tschorsch@informatik.hu-berlin.de wrote:
Hi Samir,
this sounds like an interesting summer project.
Since you are interested in using PCSA, our work on privacy-preserving statistics, which actually develops a privacy-enhanced version of PCSA, might be helpful. We also propose it as a way to collect distributed statistics.
In our HotPETs paper [1], we sketch the basic idea. In our journal paper [2], we provide additional details on the algorithm. If you have any questions, just let me know.
Cheers, Florian.
[1] https://petsymposium.org/2011/papers/hotpets11-final5Tschorsch.pdf [2] https://www.sciencedirect.com/science/article/pii/S1389128613001941
On 30. Mar 2017, at 03:45, samir menon menon.samir@gmail.com wrote:
Hi there!
I'm Samir, a Computer Science student at Stanford University, with a focus in applied cryptography and computer security. This summer, I want to work (through GSoC) on computing usage statistics without keeping IP addresses in memory (see tickets #7532 and #15469) [1] [2].
Currently, we keep sets of IP's (or hashed IP's) in memory so that we can compute the number of unique client connections. This has been pointed out as a pretty serious concern, because the IP's themselves are sensitive info that we don't want an attacker to acquire, but the statistics are relatively valuable.
As Nick first pointed out in #15469, we can use proven techniques to compute these statistics without actually explicitly storing any IP's (or IP hashes) in memory. The technique I want to use, "Probabilistic Counting with Stochastic Averaging", or PCSA, is relatively well-studied, and can provide good estimates (<5% error) of the number of unique elements in a time series.
The basic idea is to count the number of 0's before the least significant 1 in every (Jenkins hashed) IP, and then recognize that the more unique IP's we encounter, the more likely it is that we see a hashed IP with a large number of 0's before the least significant 1. (Shoutout to Jaskaran and [3] for helping me understand this). A more detailed explanation and more resources for understanding PCSA are in the proposal.
Here is my draft proposal (also attached, but links don't work): http://stanford.edu/~samir2/TorGSoCApplication.html
I'd love to hear feedback on it - what's feasible, what's most useful, and what I should focus on, etc. You can also chat with me about it on IRC at `samir2`!
Thanks, ~Samir Menon menon.samir@gmail.com Stanford University, B.S. Computer Science, 2019
[1] https://trac.torproject.org/projects/tor/ticket/7532 [2] https://trac.torproject.org/projects/tor/ticket/15469 [3] https://www.cs.princeton.edu/~rs/talks/AC11-Cardinality.pdf <TorGSoCAnonymousLocalStats.pdf>_______________________________________________ tor-dev mailing list tor-dev@lists.torproject.org https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-dev
tor-dev mailing list tor-dev@lists.torproject.org https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-dev
Hi Aaron,
These statistics not just tell about the user's country but also keep a track of unique IP addresses connecting from each country. This is needed so as to present more realistic stats. If we increment counter on any IP address instead of unique IP address then the statistics would also reflect user(s) connecting again and again. If we don't count Unique IPs, we would have stats about per country usage rather than per country users. We could do much better and implement a way(as described by the OP of thread) that counts unique IPs at the same time preserves privacy.
And for your second point about hiding the actual counter from adversary, I agree that this can potentially de-anonymize a client. An adversary (let's say the government of some small, less populous country) could try to fingerprint the traffic of it's target(s) and later correlate it with the data we publish on the metrics site. This attack could work very well for countries where the Tor users can be counted on fingers. So, I believe hiding the counter data should also be implemented along with hiding the IP addresses.
Regards, -- Jaskaran Veer Singh (jvsg) jvsg1303 at gmail dot com PGP 2814 3FB7 A32D 429B 092E 27F0 8AA3 C532 9E1A 6AD8
Aaron,
I think Jaskaran explained it well - basically, we compute statistics other than requests per country, and one of those stats is unique clients, which we can use PCSA for. The `format_client_stats_heartbeat` function in `/src/or/geoip.c` is where we actually compute the unique clients and log that in the heartbeat message.
I think perhaps my proposal doesn't make clear that this PCSA change is in addition to other methods of getting IP's out of memory - I will try to update it to emphasize this. I also will do more research on the 'fuzzing' of country counts, and I will definitely contact Karsten Loesing.
Thanks, ~Samir Menon
On Sat, Apr 1, 2017 at 11:41 AM, Jaskaran Singh jvsg1303@gmail.com wrote:
Hi Aaron,
These statistics not just tell about the user's country but also keep a track of unique IP addresses connecting from each country. This is needed so as to present more realistic stats. If we increment counter on any IP address instead of unique IP address then the statistics would also reflect user(s) connecting again and again. If we don't count Unique IPs, we would have stats about per country usage rather than per country users. We could do much better and implement a way(as described by the OP of thread) that counts unique IPs at the same time preserves privacy.
And for your second point about hiding the actual counter from adversary, I agree that this can potentially de-anonymize a client. An adversary (let's say the government of some small, less populous country) could try to fingerprint the traffic of it's target(s) and later correlate it with the data we publish on the metrics site. This attack could work very well for countries where the Tor users can be counted on fingers. So, I believe hiding the counter data should also be implemented along with hiding the IP addresses.
Regards,
Jaskaran Veer Singh (jvsg) jvsg1303 at gmail dot com PGP 2814 3FB7 A32D 429B 092E 27F0 8AA3 C532 9E1A 6AD8
tor-dev mailing list tor-dev@lists.torproject.org https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-dev
These statistics not just tell about the user's country but also keep a track of unique IP addresses connecting from each country. This is needed so as to present more realistic stats. If we increment counter on any IP address instead of unique IP address then the statistics would also reflect user(s) connecting again and again. If we don't count Unique IPs, we would have stats about per country usage rather than per country users. We could do much better and implement a way(as described by the OP of thread) that counts unique IPs at the same time preserves privacy.
It is true that this would count connections rather than unique IPs. However, Tor already infers the number of users by counting directory downloads and then adjusting that number based on how many each user is expected to make. In addition, each user doesn’t necessarily correspond to a different IP because of NAT, and so counting connections may actually be more accurate.
Best, Aaron
about which stats are you talking Aaron?
On Sun, Apr 2, 2017 at 5:45 PM, Aaron Johnson aaron.m.johnson@nrl.navy.mil wrote:
These statistics not just tell about the user's country but also keep a track of unique IP addresses connecting from each country. This is needed so as to present more realistic stats. If we increment counter on any IP address instead of unique IP address then the statistics would also reflect user(s) connecting again and again. If we don't count Unique IPs, we would have stats about per country usage rather than per country users. We could do much better and implement a way(as described by the OP of thread) that counts unique IPs at the same time preserves privacy.
It is true that this would count connections rather than unique IPs. However, Tor already infers the number of users by counting directory downloads and then adjusting that number based on how many each user is expected to make. In addition, each user doesn’t necessarily correspond to a different IP because of NAT, and so counting connections may actually be more accurate.
Best, Aaron _______________________________________________ tor-dev mailing list tor-dev@lists.torproject.org https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-dev
Sorry, I should have been more clear there. Tor Metrics estimates the total number of users by counting the number of directory downloads and dividing by an estimated expected number of directory downloads per user per day (10, I believe). This statistic is in the graph under the “Relay Users” tab on https://metrics.torproject.org/userstats-relay-country.html.
Best, Aaron
On Apr 2, 2017, at 8:51 AM, Veer Kalantri mads.531998@gmail.com wrote:
about which stats are you talking Aaron?
On Sun, Apr 2, 2017 at 5:45 PM, Aaron Johnson <aaron.m.johnson@nrl.navy.mil mailto:aaron.m.johnson@nrl.navy.mil> wrote:
These statistics not just tell about the user's country but also keep a track of unique IP addresses connecting from each country. This is needed so as to present more realistic stats. If we increment counter on any IP address instead of unique IP address then the statistics would also reflect user(s) connecting again and again. If we don't count Unique IPs, we would have stats about per country usage rather than per country users. We could do much better and implement a way(as described by the OP of thread) that counts unique IPs at the same time preserves privacy.
It is true that this would count connections rather than unique IPs. However, Tor already infers the number of users by counting directory downloads and then adjusting that number based on how many each user is expected to make. In addition, each user doesn’t necessarily correspond to a different IP because of NAT, and so counting connections may actually be more accurate.
Best, Aaron _______________________________________________ tor-dev mailing list tor-dev@lists.torproject.org mailto:tor-dev@lists.torproject.org https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-dev https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-dev
tor-dev mailing list tor-dev@lists.torproject.org https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-dev
Also, I think that counting users by IP is still a fine way to do it (absent the privacy issue that PCSA tries to address). I was just stating that my understanding based on talking to the Tor Metrics people is that the plan is to handle the privacy issue by moving to per-connection country statistics instead of by implementing PCSA.
I would also wonder how the privacy of PCSA actually compares to the privacy of per-country (noisy) counting, especially if the local statistics could be locally stored in a differentially-private way (again, this requires an accuracy analysis). As Tschorsch and Scheuermann note [0], the FM sketch used by PCSA can indicate the presence of an individual user (Sec. 4). Thus they propose to add noise by independently flipping some of the PCSA bits (Sec. 5). This seems quite similar to the differentially-private technique of adding noise to a counter. It is not clear to me that it is better to suffer the inaccuracy of the PCSA sketching plus that of the added noise when one could simply rely on adding differentially-private noise, especially when the latter provides a precise notion of privacy where the former does not.
Best, Aaron
[0] Florian Tschorsch and Björn Scheuermann, "An algorithm for privacy-preserving distributed user statistics”, Computer Networks 57 (2013).
On Apr 2, 2017, at 9:07 AM, Aaron Johnson aaron.m.johnson@nrl.navy.mil wrote:
Sorry, I should have been more clear there. Tor Metrics estimates the total number of users by counting the number of directory downloads and dividing by an estimated expected number of directory downloads per user per day (10, I believe). This statistic is in the graph under the “Relay Users” tab on <https://metrics.torproject.org/userstats-relay-country.html https://metrics.torproject.org/userstats-relay-country.html>.
Best, Aaron
On Apr 2, 2017, at 8:51 AM, Veer Kalantri <mads.531998@gmail.com mailto:mads.531998@gmail.com> wrote:
about which stats are you talking Aaron?
On Sun, Apr 2, 2017 at 5:45 PM, Aaron Johnson <aaron.m.johnson@nrl.navy.mil mailto:aaron.m.johnson@nrl.navy.mil> wrote:
These statistics not just tell about the user's country but also keep a track of unique IP addresses connecting from each country. This is needed so as to present more realistic stats. If we increment counter on any IP address instead of unique IP address then the statistics would also reflect user(s) connecting again and again. If we don't count Unique IPs, we would have stats about per country usage rather than per country users. We could do much better and implement a way(as described by the OP of thread) that counts unique IPs at the same time preserves privacy.
It is true that this would count connections rather than unique IPs. However, Tor already infers the number of users by counting directory downloads and then adjusting that number based on how many each user is expected to make. In addition, each user doesn’t necessarily correspond to a different IP because of NAT, and so counting connections may actually be more accurate.
Best, Aaron _______________________________________________ tor-dev mailing list tor-dev@lists.torproject.org mailto:tor-dev@lists.torproject.org https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-dev https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-dev
tor-dev mailing list tor-dev@lists.torproject.org mailto:tor-dev@lists.torproject.org https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-dev
On 02.04.17 15:22, Aaron Johnson wrote:
Also, I think that counting users by IP is still a fine way to do it (absent the privacy issue that PCSA tries to address). I was just stating that my understanding based on talking to the Tor Metrics people is that the plan is to handle the privacy issue by moving to per-connection country statistics instead of by implementing PCSA.
That's true, and thanks, Aaron, for responding here!
The metrics team indeed has plans in this direction:
https://trac.torproject.org/projects/tor/wiki/org/teams/MetricsTeam#Objectiv...
""" - 1.4. Reduce the amount of sensitive, potentially personally identifying data stored in memory of Tor relays and bridges by implementing new directory-request statistics based on requests by country, transport, and IP version and removing existing directory-request statistics based on unique IP addresses by country, transport, or IP version (Sponsor X 4.2. Tor daemon) """
Note that this is a plan of the metrics team for the current quarter which is not yet discussed in detail with the network team. But that discussion won't happen before the GSoC student application deadline, which is in ~24 hours day, I believe. We're planning to write a proposal in the next few weeks and then discuss it here.
Also note that there are still other unique IP statistics than the ones on connecting directory clients, even though they are disabled by default and thus less relevant. Still, protecting unique IP addresses of clients connecting to entry guards seems like a worthwhile project.
Sorry for not being more helpful.
Good luck with GSoC applications, everyone!
All the best, Karsten