Re: [tor-dev] Proposal 288: Privacy-Preserving Statistics with Privcount in Tor (Shamir version)

13 Dec 2017

...
On 2 Dec 2017, at 07:26, Nick Mathewson <nickm@torproject.org> wrote:
Filename: 288-privcount-with-shamir.txt Title: Privacy-Preserving Statistics
with Privcount in Tor (Shamir version) Author: Nick Mathewson, Tim
Wilson-Brown, Aaron Johnson Created: 1-Dec-2017 Supercedes: 280 Status: Draft
0. Acknowledgments
Tariq Elahi, George Danezis, and Ian Goldberg designed and implemented the
 PrivEx blinding scheme. Rob Jansen and Aaron Johnson extended PrivEx's
 differential privacy guarantees to multiple counters in PrivCount:
https://github.com/privcount/privcount/blob/master/README.markdown#research-...
Rob Jansen and Tim Wilson-Brown wrote the majority of the experimental
 PrivCount code, based on the PrivEx secret-sharing variant. This
 implementation includes contributions from the PrivEx authors, and others:
https://github.com/privcount/privcount/blob/master/CONTRIBUTORS.markdown
This research was supported in part by NSF grants CNS-1111539, CNS-1314637,
 CNS-1526306, CNS-1619454, and CNS-1640548.
The use of a Shamir secret-sharing-based approach is due to a suggestion by
 Aaron Johnson (iirc); Carolin Zöbelein did some helpful analysis here.
Aaron Johnson and Tim Wilson-Brown made improvements to the draft proposal.
1. Introduction and scope
PrivCount is a privacy-preserving way to collect aggregate statistics about
 the Tor network without exposing the statistics from any single Tor relay.
This document describes the behavior of the in-Tor portion of the PrivCount
 system.  It DOES NOT describe the counter configurations, or any other
 parts of the system. (These will be covered in separate proposals.)
2. PrivCount overview
Here follows an oversimplified summary of PrivCount, with enough
 information to explain the Tor side of things.  The actual operation of the
 non-Tor components is trickier than described below.
In PrivCount, a Data Collector (DC, in this case a Tor relay) shares
 numeric data with N different Tally Reporters (TRs). (A Tally Reporter
 performs the summing and unblinding roles of the Tally Server and Share
 Keeper from experimental PrivCount.)
All N Tally Reporters together can reconstruct the original data, but no
 (N-1)-sized subset of the Tally Reporters can learn anything about the
 data.
(In reality, the Tally Reporters don't reconstruct the original data at
 all! Instead, they will reconstruct a _sum_ of the original data across all
 participating relays.)
In brief, the system works as follow:
To share data, for each counter value V to be shared, the Data Collector
 first adds Gaussian noise to V in order to produce V', uses (K,N) Shamir
 secret-sharing to generate N shares of V' (K<=N, K being the reconstruction
 threshold), encrypts each share to a different Tally Reporter, and sends
 each encrypted share to the Tally Reporter it is encrypted for.
The Tally Reporters then agree on the set S of Data Collectors that sent
 data to all of them, and each Tally Reporter forms a share of the aggregate
 value by decrypting the shares it received from the Data Collectors in S
 and adding them together. The Tally Reporters then, collectively, perform
 secret reconstruction, thereby learning the sum of all the different values
 V'.
The use of Shamir secret sharing lets us survive up to N-K crashing TRs.
 Waiting until the end to agree on a set S of surviving relays lets us
 survive an arbitrary number of crashing DCs. In order to prevent bogus data
 from corrupting the tally, the Tally Reporters can perform the aggregation
 step multiple times, each time proceeding with a different subset of S and
 taking the median of the resulting values.
Relay subsets should be chosen at random to avoid relays manipulating their
 subset membership(s). If an shared random value is required, all relays
 must submit their results, and then the next revealed shared random value
 can be used to select relay subsets. (Tor's shared random value can be
 calculated as soon as all commits have been revealed. So all relay results
 must be received *before* any votes are cast in the reveal phase for that
 shared random value.)
Below we describe the algorithm in more detail, and describe the data
 format to use.
...
Chelsea and I had a conversation on another list about Prio.

Prio is similar to PrivCount, but would allow the Data Collectors to
prove that their responses are in range, using a zero-knowledge proof
technique.

I'm posting this here with her permission:
...
From: teor <teor2345@gmail.com>
Date: 13 December 2017 at 09:26:45 AEDT
...
On 13 Dec 2017, at 03:02, chelsea komlo <me@chelseakomlo.com> wrote:
Hi All,
This came up in a slack channel I am a part of, and it seems relevant to
PrivCount work- I wanted to pass it along.
https://crypto.stanford.edu/prio/
"Prio is a privacy-preserving system for the collection of aggregate
statistics. Each Prio client holds a private data value (e.g., its
current location), and a small set of servers compute statistical
functions over the values of all clients (e.g., the most popular
location). As long as at least one server is honest, the Prio servers
learn nearly nothing about the clients’ private data, except what they
can infer from the aggregate statistics that the system computes"
I read the paper. I'm sure I didn't understand it all.
It's very new, but it shows promise.
And there's no real-world (even experimental) implementation yet.
Maybe we can architect PrivCount in Tor so it's possible to upgrade
to a system like this in future. In particular, we seem to be using
similar primitives to Prio, so that's useful.
Are there things we can change in the current proposal to make it easier
to upgrade to a system like Prio in future?

Perhaps it's enough to version all our data exchange formats.
I'm not sure if we should lock into one particular scheme, when
something better might come along.

T

--
Tim / teor

PGP C855 6CED 5D90 A0C5 29F6 4D43 450C BA7F 968F 094B
ricochet:ekmygaiu4rzgsk6n
------------------------------------------------------------------------