commit 8f56a246f8c2501e1c52594f804292ad6b7cca06 Author: Nick Mathewson nickm@torproject.org Date: Fri Dec 1 15:25:10 2017 -0500
Add privcount-with-shamir proposal --- proposals/000-index.txt | 2 + proposals/288-privcount-with-shamir.txt | 564 ++++++++++++++++++++++++++++++++ 2 files changed, 566 insertions(+)
diff --git a/proposals/000-index.txt b/proposals/000-index.txt index dd9abdc..4b055f0 100644 --- a/proposals/000-index.txt +++ b/proposals/000-index.txt @@ -208,6 +208,7 @@ Proposals by number: 285 Directory documents should be standardized as UTF-8 [OPEN] 286 Controller APIs for hibernation access on mobile [OPEN] 287 Reduce circuit lifetime without overloading the network [OPEN] +288 Privacy-Preserving Statistics with Privcount in Tor (Shamir version) [DRAFT]
Proposals by status: @@ -233,6 +234,7 @@ Proposals by status: 279 A Name System API for Tor Onion Services 280 Privacy-Preserving Statistics with Privcount in Tor 281 Downloading microdescriptors in bulk + 288 Privacy-Preserving Statistics with Privcount in Tor (Shamir version) NEEDS-REVISION: 190 Bridge Client Authorization Based on a Shared Secret NEEDS-RESEARCH: diff --git a/proposals/288-privcount-with-shamir.txt b/proposals/288-privcount-with-shamir.txt new file mode 100644 index 0000000..4489048 --- /dev/null +++ b/proposals/288-privcount-with-shamir.txt @@ -0,0 +1,564 @@ +Filename: 288-privcount-with-shamir.txt +Title: Privacy-Preserving Statistics with Privcount in Tor (Shamir version) +Author: Nick Mathewson, Tim Wilson-Brown, Aaron Johnson +Created: 1-Dec-2017 +Supercedes: 280 +Status: Draft + +0. Acknowledgments + + Tariq Elahi, George Danezis, and Ian Goldberg designed and implemented + the PrivEx blinding scheme. Rob Jansen and Aaron Johnson extended + PrivEx's differential privacy guarantees to multiple counters in + PrivCount: + + https://github.com/privcount/privcount/blob/master/README.markdown#research-... + + Rob Jansen and Tim Wilson-Brown wrote the majority of the experimental + PrivCount code, based on the PrivEx secret-sharing variant. This + implementation includes contributions from the PrivEx authors, and + others: + + https://github.com/privcount/privcount/blob/master/CONTRIBUTORS.markdown + + This research was supported in part by NSF grants CNS-1111539, + CNS-1314637, CNS-1526306, CNS-1619454, and CNS-1640548. + + The use of a Shamir secret-sharing-based approach is due to a + suggestion by Aaron Johnson (iirc); Carolin Zöbelein did some helpful + analysis here. + + Aaron Johnson and Tim Wilson-Brown made improvements to the draft proposal. + +1. Introduction and scope + + PrivCount is a privacy-preserving way to collect aggregate statistics + about the Tor network without exposing the statistics from any single + Tor relay. + + This document describes the behavior of the in-Tor portion of the + PrivCount system. It DOES NOT describe the counter configurations, + or any other parts of the system. (These will be covered in separate + proposals.) + +2. PrivCount overview + + Here follows an oversimplified summary of PrivCount, with enough + information to explain the Tor side of things. The actual operation + of the non-Tor components is trickier than described below. + + In PrivCount, a Data Collector (DC, in this case a Tor relay) shares + numeric data with N different Tally Reporters (TRs). (A Tally Reporter + performs the summing and unblinding roles of the Tally Server and Share + Keeper from experimental PrivCount.) + + All N Tally Reporters together can reconstruct the original data, but + no (N-1)-sized subset of the Tally Reporters can learn anything about + the data. + + (In reality, the Tally Reporters don't reconstruct the original data + at all! Instead, they will reconstruct a _sum_ of the original data + across all participating relays.) + + In brief, the system works as follow: + + To share data, for each counter value V to be shared, the Data Collector + first adds Gaussian noise to V in order to produce V', uses (K,N) Shamir + secret-sharing to generate N shares of V' (K<=N, K being the + reconstruction threshold), encrypts each share to a different Tally + Reporter, and sends each encrypted share to the Tally Reporter it + is encrypted for. + + The Tally Reporters then agree on the set S of Data Collectors that sent + data to all of them, and each Tally Reporter forms a share of the aggregate + value by decrypting the shares it received from the Data Collectors in S + and adding them together. The Tally Reporters then, collectively, perform + secret reconstruction, thereby learning the sum of all the different + values V'. + + The use of Shamir secret sharing lets us survive up to N-K crashing TRs. + Waiting until the end to agree on a set S of surviving relays lets us + survive an arbitrary number of crashing DCs. In order to prevent bogus + data from corrupting the tally, the Tally Reporters can perform the + aggregation step multiple times, each time proceeding with a different + subset of S and taking the median of the resulting values. + + Relay subsets should be chosen at random to avoid relays manipulating their + subset membership(s). If an shared random value is required, all relays must + submit their results, and then the next revealed shared random value can + be used to select relay subsets. (Tor's shared random value can be + calculated as soon as all commits have been revealed. So all relay results + must be received *before* any votes are cast in the reveal phase for that + shared random value.) + + Below we describe the algorithm in more detail, and describe the data + format to use. + +3. The algorithm + + All values below are B-bit integers modulo some prime P; we suggest + B=62 and P = 2**62 - 2**30 - 1 (hex 0x3fffffffbfffffff). The size of + this field is an upper limit on the largest sum we can calculate; it + is not a security parameter. + + There are N Tally Reporters: every participating relay must agree on + which N exist, and on their current public keys. We suggest listing + them in the consensus networkstatus document. All parties must also + agree on some ordering the Tally Reporters. Similarly, all parties + must also agree on some value K<=N. + + There are a number of well-known "counters", identified known by ASCII + identifiers. Each counter is a value that the participating relays + will know how to count. Let C be the number of counters. + +3.1. Data Collector (DC) side + + At the start of each period, every Data Collector ("client" below) + initializes their state as follows + + 1. For every Tally Reporter with index i, the client constructs a + random 32-byte random value SEED_i. The client then generates + a pseudorandom bitstream of C*B bits using the SHAKE-256 + XOF with SEED_i as its input, and divides this stream into + C values, with the c'th value denoted by MASK(i, c). + + [Because P is very close to a power of 2, nearly all seeds will + produce MASK values in range 0...(P-1). If any does not, the + client picks a new seed.] + + 2. The client encrypts SEED_i using the public key of Tally + Reporter i, and remembers this encrypted value. It discards + SEED_i. + + 3. For every counter c, the client generates a noise value Z_c + from an appropriate Gaussian distribution. If the noise value is + negative, the client adds P to bring Z_c into the range 0...(P-1). + (The noise MUST be sampled using the procedure in Appendix C.) + + The client then uses Shamir secret sharing to generate + N shares (x,y) of Z_c, 1 <= x <= N, with the x'th share to be used by + the x'th Tally Reporter. See Appendix A for more on Shamir secret + sharing. See Appendix B for another idea about X coordinates. + + The client picks a random value CTR_c and stores it in the counter, + which serves to locally blind the counter. + + The client then subtracts (MASK(x, c)+CTR_c) from y, giving + "encrypted shares" of (x, y0) where y0 = y-CTR_c. + + The client then discards all MASK values, all CTR values, and all + original shares (x,y), all CTR and the noise value Z_c. For each + counter c, it remembers CTR_c, and N shares of the form (x, y). + + To increment a counter by some value "inc": + + 1. The client adds "inc" to counter value, modulo P. + + (This step is chosen to be optimal, since it will happen more + frequently than any other step in the computation.) + + Aggregate counter values that are close to P/2 MUST be scaled to + avoid overflow. See Appendix D for more information. (We do not think + that any counters on the current Tor network will require scaling.) + + To publish the counter values: + + 1. The client publishes, in the format described below: + + The list of counters it knows about + The list of TRs it knows about + For each TR: + For each counter c: + A list of (i, y-CTR_c-MASK(x,c)), which corresponds + to the share for the i'th TR of counter c. + SEED_i as encrypted earlier to the i'th TR's public key. + +3.2. Tally Reporter (TR) side + + This section is less completely specified than the Data Collector's + behavior: I expect that the TRs will be easier to update as we proceed. + + (Each TR has a long-term identity key (ed25519). It also has a + sequence of short-term curve25519 keys, each associated with a single + round of data collection.) + + 1. When a group of TRs receives information from the Data Collectors, + they collectively chose a set S of DCs and a set of counters such + that every TR in the group has a valid entry for every counter, + from every DC in the set. + + To be valid, an entry must not only be well-formed, but must also + have the x coordinate in its shares corresponding to the + TR's position in the list of TRs. + + 2. For each Data Collector's report, the i'th TR decrypts its part of + the client's report using its curve25519 key. It uses SEED_i and + SHAKE-256 to regenerate MASK(0) through MASK(C-1). Then for each + share (x, y-CTR_c-MASK(x,c)) (note that x=i), the TR reconstructs the + true share of the value for that DC and counter c by adding + V+MASK(x,c) to the y coordinate to yield the share (x, y_final). + + 3. For every counter in the set, each TR computes the sum of the + y_final values from all clients. + + 4. For every counter in the set, each TR publishes its a share of + the sum as (x, SUM(y_final)). + + 5. If at least K TRs publish correctly, then the sum can be + reconstructed using Lagrange polynomial interpolation. (See + Appendix A). + + 6. If the reconstructed sum is greater than P/2, it is probably a negative + value. The value can be obtained by subtracting P from the sum. + (Negative values are generated when negative noise is added to small + signals.) + + 7. If scaling has been applied, the sum is scaled by the scaling factor. + (See Appendix D.) + +4. The document format + +4.1. The counters document. + + This document format builds on the line-based directory format used + for other tor documents, described in Tor's dir-spec.txt. + + Using this format, we describe a "counters" document that publishes + the shares collected by a given DC, for a single TR. + + The "counters" document has these elements: + + "privctr-dump-format" SP VERSION SP SigningKey + + [At start, exactly once] + + Describes the version of the dump format, and provides an ed25519 + signing key to identify the relay. The signing key is encoded in + base64 with padding stripped. VERSION is "alpha" now, but should + be "1" once this document is finalized. + + "starting-at" SP IsoTime + + [Exactly once] + + The start of the time period when the statistics here were + collected. + + "ending-at" SP IsoTime + + [Exactly once] + + The end of the time period when the statistics here were + collected. + + "share-parameters" SP Number SP Number + + [Exactly once] + + The number of shares needed to reconstruct the client's + measurements (K), and the number of shares produced (N), + respectively. + + "tally-reporter" SP Identifier SP Integer SP Key + + [At least twice] + + The curve25519 public key of each Tally Reporter that the relay + believes in. (If the list does not match the list of + participating Tally Reporters, they won't be able to find the + relay's values correctly.) The identifiers are non-space, + non-nul character sequences. The Key values are encoded in + base64 with padding stripped; they must be unique within each + counters document. The Integer values are the X coordinate of + the shares associated with each Tally Reporter. + + "encrypted-to-key" SP Key + + [Exactly once] + + The curve25519 public key to which the report below is encrypted. + Note that it must match one of the Tally Reporter options above. + + + "report" NL + "----- BEGIN ENCRYPTED MESSAGE-----" NL + Base64Data + "----- END ENCRYPTED MESSAGE-----" NL + + [Exactly once] + + An encrypted document, encoded in base64. The plaintext format is + described in section 4.2. below. The encryption is as specified in + section 5 below, with STRING_CONSTANT set to "privctr-shares-v1". + + "signature" SP Signature + + [At end, exactly once] + + The Ed25519 signature of all the fields in the document, from the + first byte, up to but not including the "signature" keyword here. + The signature is encoded in base64 with padding stripped. + +4.2. The encrypted "shares" document. + + The shares document is sent, encrypted, in the "report" element above. + Its plaintext contents include these fields: + + "encrypted-seed" NL + "----- BEGIN ENCRYPTED MESSAGE-----" NL + Base64Data + "----- END ENCRYPTED MESSAGE-----" NL + + [At start, exactly once.] + + An encrypted document, encoded in base64. The plaintext value is + the 32-byte value SEED_i for this TR. The encryption is as + specified in section 5 below, with STRING_CONSTANT set to + "privctr-seed-v1". + + "d" SP Keyword SP Integer + + [Any number of times] + + For each counter, the name of the counter, and the obfuscated Y + coordinate of this TR's share for that counter. (The Y coordinate + is calculated as y-CTR_c as in 3.1 above.) The order of counters + must correspond to the order used when generating the MASK() values; + different clients do not need to choose the same order. + +5. Hybrid encryption + + This scheme is taken from rend-spec-v3.txt, section 2.5.3, replacing + "secret_input" and "STRING_CONSTANT". It is a hybrid encryption + method for encrypting a message to a curve25519 public key PK. + + We generate a new curve25519 keypair (sk,pk). + + We run the algorithm of rend-spec-v3.txt 2.5.3, replacing + "secret_input" with Curve25519(sk,PK) | SigningKey, where + SigningKey is the DC's signing key. (Including the DC's SigningKey + here prevents one DC from replaying another one's data.) + + We transmit the encrypted data as in rend-spec-v3.txt 2.5.3, + prepending pk. + + +Appendix A. Shamir secret sharing for the impatient + + In Shamir secret sharing, you want to split a value in a finite + field into N shares, such that any K of the N shares can + reconstruct the original value, but K-1 shares give you no + information at all. + + The key insight here is that you can reconstruct a K-degree + polynomial given K+1 distinct points on its curve, but not given + K points. + + So, to split a secret, we going to generate a (K-1)-degree + polynomial. We'll make the Y intercept of the polynomial be our + secret, and choose all the other coefficients at random from our + field. + + Then we compute the (x,y) coordinates for x in [1, N]. Now we + have N points, any K of which can be used to find the original + polynomial. + + Moreover, we can do what PrivCount wants here, because adding the + y coordinates of N shares gives us shares of the sum: If P1 is + the polynomial made to share secret A and P2 is the polynomial + made to share secret B, and if (x,y1) is on P1 and (x,y2) is on + P2, then (x,y1+y2) will be on P1+P2 ... and moreover, the y + intercept of P1+P2 will be A+B. + + To reconstruct a secret from a set of shares, you have to either + go learn about Lagrange polynomials, or just blindly copy a + formula from your favorite source. + + Here is such a formula, as pseudocode^Wpython, assuming that + each share is an object with a _x field and a _y field. + + def interpolate(shares): + for sh in shares: + product_num = FE(1) + product_denom = FE(1) + for sh2 in shares: + if sh2 is sh: + continue + product_num *= sh2._x + product_denom *= (sh2._x - sh._x) + + accumulator += (sh._y * product_num) / product_denom + + return accumulator + + +Appendix B. An alternative way to pick X coordinates + + Above we describe a system where everybody knows the same TRs and + puts them in the same order, and then does Shamir secret sharing + using "x" as the x coordinate for the x'th TR. + + But what if we remove that requirement by having x be based on a hash + of the public key of the TR? Everything would still work, so long as + all users chose the same K value. It would also let us migrate TR + sets a little more gracefully. + + +Appendix C. Sampling floating-point Gaussian noise for differential privacy + + Background: + + When we add noise to a counter value (signal), we want the added noise to + protect all of the bits in the signal, to ensure differential privacy. + + But because noise values are generated from random double(s) using + floating-point calculations, the resulting low bits are not distributed + evenly enough to ensure differential privacy. + + As implemented in the C "double" type, IEEE 754 double-precision + floating-point numbers contain 53 significant bits in their mantissa. This + means that noise calculated using doubles can not ensure differential + privacy for client activity larger than 2**53: + * if the noise is scaled to the magnitude of the signal using + multiplication, then the low bits are unprotected, + * if the noise is not scaled, then the high bits are unprotected. + + But the operations in the noise transform also suffer from floating-point + inaccuracy, further affecting the low bits in the mantissa. So we can only + protect client activity up to 2**46 with Laplacian noise. (We assume that + the limit for Gaussian noise is similar.) + + Our noise generation procedure further reduces this limit to 2**42. For + byte counters, 2**42 is 4 Terabytes, or the observed bandwidth of a 1 Gbps + relay running at full speed for 9 hours. It may be several years before we + want to protect this much client activity. However, since the mitigation is + relatively simple, we specify that it MUST be implemented. + + Procedure: + + Data collectors MUST sample noise as follows: + 1. Generate random double(s) in [0, 1] that are integer multiples of + 2**-53. + TODO: the Gaussian transform in step 2 may require open intervals + 2. Generate a Gaussian floating-point noise value at random with sigma 1, + using the random double(s) generated in step 1. + 3. Multiply the floating-point noise by the floating-point sigma value. + 4. Truncate the scaled noise to an integer to remove the fractional bits. + (These bits can never correspond to signal bits, because PrivCount only + collects integer counters.) + 5. If the floating-point sigma value from step 3 is large enough that any + noise value could be greater than or equal to 2**46, we need to + randomise the low bits of the integer scaled noise value. (This ensures + that the low bits of the signal are always hidden by the noise.) + + If we use the sample_unit_gaussian() transform in nickm/privcount_nm: + A. The maximum r value is sqrt(-2.0*ln(2**-53)) ~= 8.57, and the + maximal sin(theta) values are +/- 1.0. Therefore, the generated + noise values can be greater than or equal to 2**46 when the sigma + value is greater than 2**42. + B. Therefore, the number of low bits that need to be randomised is: + N = floor(sigma / 2**42) + C. We randomise the lowest N bits of the integer noise by replacing them + with a uniformly distributed N-bit integer value in 0...(2**N)-1. + 6. Add the integer noise to the integer counter, before the counter is + incremented in response to events. (This ensures that the signal value + is always protected.) + + This procedure is security-sensitive: changing the order of + multiplications, truncations, or bit replacements can expose the low or + high bits of the signal or noise. + + As long as the noise is sampled using this procedure, the low bits of the + signal are protected. So we do not need to "bin" any signals. + + The impact of randomising more bits than necessary is minor, but if we fail + to randomise an unevenly distributed bit, client activity can be exposed. + Therefore, we choose to randomise all bits that could potentially be affected + by floating-point inaccuracy. + + Justification: + + Although this analysis applies to Laplacian noise, we assume a similar + analysis applies to Gaussian noise. (If we add Laplacian noise on DCs, + the total ends up with a Gaussian distribution anyway.) + + TODO: check that the 2**46 limit applies to Gaussian noise. + + This procedure results in a Gaussian distribution for the higher ~42 bits + of the noise. We can safely ignore the value of the lower bits of the noise, + because they are insignificant for our reporting. + + This procedure is based on section 5.2 of: + "On Significance of the Least Significant Bits For Differential Privacy" + Ilya Mironov, ACM CCS 2012 + https://www.microsoft.com/en-us/research/wp-content/uploads/2012/10/lsbs.pdf + + We believe that this procedure is safe, because we neither round nor smooth + the noise values. The truncation in step 4 has the same effect as Mironov's + "safe snapping" procedure. Randomising the low bits removes the 2**46 limit + on the sigma value, at the cost of departing slightly from the ideal + infinite-precision Gaussian distribution. (But we already know that these + bits are distributed poorly, due to floating-point inaccuracy.) + + Mironov's analysis assumes that a clamp() function is available to clamp + large signal and noise values to an infinite floating-point value. + Instead of clamping, PrivCount's arithmetic wraps modulo P. We believe that + this is safe, because any reported values this large will be meaningless + modulo P. And they will not expose any client activity, because "modulo P" + is an arithmetic transform of the summed noised signal value. + + Alternatives: + + We could round the encrypted value to the nearest multiple of the + unprotected bits. But this relies on the MASK() value being a uniformly + distributed random value, and it is less generic. + + We could also simply fail when we reach the 2**42 limit on the sigma value, + but we do not want to design a system with a limit that low. + + We could use a pure-integer transform to create Gaussian noise, and avoid + floating-point issues entirely. But we have not been able to find an + efficient pure-integer Gaussian or Laplacian noise transform. Nor do we + know if such a transform can be used to ensure differential privacy. + + +Appendix D. Scaling large counters + + We do not believe that scaling will be necessary to collect PrivCount + statistics in Tor. As of November 2017, the Tor network advertises a + capacity of 200 Gbps, or 2**51 bytes per day. We can measure counters as + large as ~2**61 before reaching the P/2 counter limit. + + If scaling becomes necessary, we can scale event values (and noise sigmas) + by a scaling factor before adding them to the counter. Scaling may introduce + a bias in the final result, but this should be insignificant for reporting. + + +Appendix Z. Remaining client-side uncertainties + + [These are the uncertainties at the client side. I'm not considering + TR-only operations here unless they affect clients.] + + Should we do a multi-level thing for the signing keys? That is, have + an identity key for each TR and each DC, and use those to sign + short-term keys? + + How to tell the DCs the parameters of the system, including: + - who the TRs are, and what their keys are? + - what the counters are, and how much noise to add to each? + - how do we impose a delay when the noise parameters change? + (this delay ensures differential privacy even when the old and new + counters are compared) + - or should we try to monotonically increase counter noise? + - when the collection intervals start and end? + - what happens in networks where some relays report some counters, and + other relays report other counters? + - do we just pick the latest counter version, as long as enough relays + support it? + (it's not safe to report multiple copies of counters) + + How the TRs agree on which DCs' counters to collect? + + How data is uploaded to DCs? + + What to say about persistence on the DC side?
tor-commits@lists.torproject.org