Hi! Tim helped me write this this draft last week, and I shared it with the PrivCount authors. I've already gotten some good comments from Aaron, which I'll repost in a followup message, with his permission.
================================
Filename: 280-privcount-in-tor.txt Title: Privacy-Preseving Statistics with Privcount in Tor Author: Nick Mathewson, Tim Wilson-Brown Created: 02-Aug-2017 Status: Draft
0. Acknowledgments
Tariq Elahi, George Danezis, and Ian Goldberg designed and implemented the PrivEx blinding scheme. Rob Jansen and Aaron Johnson extended PrivEx's differential privacy guarantees to multiple counters in PrivCount:
https://github.com/privcount/privcount/blob/master/README.markdown#research-...
Rob Jansen and Tim Wilson-Brown wrote the majority of the experimental PrivCount code, based on the PrivEx secret-sharing variant. This implementation includes contributions from the PrivEx authors, and others:
https://github.com/privcount/privcount/blob/master/CONTRIBUTORS.markdown
1. Introduction and scope
PrivCount is a privacy-preserving way to collect aggregate statistics about the Tor network without exposing the statistics from any single Tor relay.
This document describes the behavior of the in-Tor portion of the PrivCount system. It DOES NOT describe the counter configurations, or any other parts of the system. (These will be covered in separate proposals.)
2. PrivCount overview
Here follows an oversimplified summary of PrivCount, with enough information to explain the Tor side of things. The actual operation of the non-Tor components is trickier than described below.
All values in the scheme below are 64-bit unsigned integers; addition and subtraction are modulo 2^64.
In PrivCount, a Data Collector (in this case a Tor relay) shares numeric data with N different Tally Reporters. (A Tally Reporter performs the summing and unblinding roles of the Tally Server and Share Keeper from experimental PrivCount.)
All N Tally Reporters together can reconstruct the original data, but no (N-1)-sized subset of the Tally Reporters can learn anything about the data.
(In reality, the Tally Reporters don't reconstruct the original data at all! Instead, they will reconstruct a _sum_ of the original data across all participating relays.)
To share data, for each value X to be shared, the relay generates random values B_1 though B_n, and shares each B_i secretly with a single Tally Reporter. The relay then publishes Y = X + SUM(B_i) + Z, where Z is a noise value taken at random from a gaussian distribution. The Tally Reporters can reconstruct X+Z by securely computing SUM(B_i) across all contributing Data Collectors. (Tally Reporters MUST NOT share individual B_i values: that would expose the underlying relay totals.)
In order to prevent bogus data from corrupting the tally, the Tor relays and the Tally Reporters perform multiple "instances" of this algorithm, randomly sampling in each relays. The relay sends multiple Y values for each measurement, built with different sets of B_i. These "instances" are numbered in order from 1 to R.
So that the system will still produce results in the event of a single Tally Reporter failure, these instances are distributed across multiple subsets of Tally Reporters.
Below we describe a data format for this.
3. The document format
This document format builds on the line-based directory format used for other tor documents, described in Tor's dir-spec.txt.
Using this format, we describe two kinds of documents here: a "counters" document that publishes all the Y values, and a "blinding" document that describes the B_i values. But see "An optimized alternative" below.
The "counters" document has these elements:
"privctr-dump-format" SP VERSION SP SigningKey
[At start, exactly once]
Describes the version of the dump format, and provides an ed25519 signing key to identify the relay. The signing key is encoded in base64 with padding stripped. VERSION is "alpha" now, but should be "1" once this document is finalized.
[[[TODO: Do we need a counter version as well?
Noise is distributed across a particular set of counters, to provide differential privacy guarantees for those counters. Reducing noise requires a break in the collection. Adding counters is ok if the noise on each counter monotonically increases. (Removing counters always reduces noise.)
We also need to work out how to handle instances with mixed Tor versions, where some Data Collectors report different counters to other Data Collectors. (The blinding works if we substitute zeroes for missing counters on Tally Reporters. But we also need to add noise in this case.)
-teor ]]]
"starting-at" SP IsoTime
[Exactly once]
The start of the time period when the statistics here were collected.
"ending-at" SP IsoTime
[Exactly once]
The end of the time period when the statistics here were collected.
"num-instances" SP Number
[Exactly once]
The number of "instances" that the relay used (see above.)
"tally-reporter" SP Identifier SP Key SP InstanceNumbers
[At least twice]
The curve25519 public key of each Tally Reporter that the relay believes in. (If the list does not match the list of participating tally reporters, they won't be able to find the relay's values correctly.) The identifiers are non-space, non-nul character sequences. The Key values are encoded in base64 with padding stripped; they must be unique within each counters document. The InstanceNumbers are comma-separated lists of decimal integers from 0 to (num-instances - 1), in ascending order.
Keyword ":" SP Int SP Int SP Int ...
[Any number of times]
The Y values for a single measurement. There are num-instances such Y values for each measurement. They are 64-bit unsigned integers, expressed in decimal.
The "Keyword" denotes which measurement is being shared. Keyword MAY be any sequence of characters other than colon, nul, space, and newline, though implementators SHOULD avoid getting too creative here. Keywords MUST be unique within a single document. Tally Reporters MUST handle unrecognized keywords. Keywords MAY appear in any order.
It is safe to send the blinded totals for each instance to every Tally Reporter. To unblind the totals, a Tally Reporter needs: * a blinding document from each relay in the instance, and * the per-counter blinding sums from the other Tally Reporters in their instance.
[[[TODO: But is it safer to create a per-instance counters document? -- teor]]]
The semantics of individual measurements are not specified here.
"signature" SP Signature
[At end, exactly once]
The Ed25519 signature of all the fields in the document, from the first byte, up to but not including the "signature" keyword here. The signature is encoded in base64 with padding stripped.
The "blinding" document has these elements:
"privctr-secret-offsets" SP VERSION SP SigningKey
[At start, exactly once.]
The VERSION and SigningKey parameters are the same as for "privctr-dump-format".
"instances" SP Numbers
[Exactly once]
The instances that this Tally Reporter handles. They are given as comma-separated decimal integers, as in the "tally-reporter" entry in the counters document. They MUST match the instances listed in the counters document.
[[[TODO: this is redundant. Specify the constraint instead? --teor]]]
"num-counters" SP Number
[Exactly once]
The number of counters that the relay used in its counters document. This MUST be equal to the number of keywords in the counters document.
[[[TODO: this is redundant. Specify the constraint instead? --teor]]]
"tally-reporter-pubkey" SP Key
[Exactly once]
The curve25519 public key of the tally reporter who is intended to receive an decrypt this document. The key is base64-encoded with padding stripped.
"count-document-digest" SP "sha3" Digest NL "-----BEGIN ENCRYPTED DATA-----" NL Data "-----END ENCRYPTED DATA-----" NL
[Exactly once]
The SHA3-256 digest of the count document corresponding to this blinding document. The digest is base64-encoded with padding stripped. The data encodes the blinding values (See "The Blinding Values") below, and is encrypted to the tally reporter's public key using the hybrid encryption algorithm described below.
"signature" SP Signature
[At end, exactly once]
The Ed25519 signature of all the fields in the document, from the first byte, up to but not including the "signature" keyword here. The signature is encoded in base64 with padding stripped.
4. The Blinding Values
The "Data" field of the blinding documents above, when decrypted, yields a sequence of 64-bit binary values, encoded in network (big-endian) order. There are C * R such values, where C is the number of keywords in the count document, and R is the number of instances that the Tally Reporter participates in. The client generates all of these values uniformly at random.
For each keyword in the count document, in the order specified by the count document, the decrypted data holds R*8 bytes for the specified instance of that keyword's blinded counter.
For example: if the count document lists the keywords "b", "x", "g", and "a" (in that order), and lists instances "0", and "2", then the decrypted data will hold the blinding values in this order: b, instance 0 b, instance 2 x, instance 0 x, instance 2 g, instance 0 g, instance 2 a, instance 0 a, instance 2
4. Implementation Notes
A relay should, when starting a new round, generate all the blinding values and noise values in advance. The relay should then use these values to compute Y_0 = SUM(B_i) + Z for each instance of each counter. Having done this, the relay MUST encrypt the blinding values to the public key of each tally reporter, and wipe them from memory.
5. The hybrid encryption algorithm
We use a hybrid encryption scheme above, where items can be encrypted to a public key. We instantiate it as follows, using curve25519 public keys.
To encrypt a plaintext M to a public key PK1 1. the sender generates a new ephemeral keypair sk2, PK2. 2. The sender computes the shared diffie hellman secret SEED = (sk2 * PK1).
3. The sender derives 64 bytes of key material as SHAKE256(TEXT | SEED)[...64] where "TEXT" is "Expand curve25519 for privcount encryption".
The first 32 bytes of this is an aes key K1; the second 32 bytes are a mac key K2.
4. The sender computes a ciphertext C as AES256_CTR(K1, M)
5. The sender computes a MAC as SHA3_256([00 00 00 00 00 00 00 20] | K2 | C)
6. The hybrid-encrypted text is PK2 | MAC | C.
6. An optimized alternative
As an alternative, the sequences of blinding values is NOT transmitted to the tally reporters. Instead the client generates a single ephemeral keypair sk_c, PK_c, and places the public key in its counts document. It does this each time a new round begins.
For each tally reporter with public key PK_i, the client then does the handshake sk_c * PK_i to compute SEED_i.
The client then generates the blinding values for that tally reporter as SHAKE256(SEED_i)[...R*C*8].
After initializing the counters to Y_0, the client can discard the blinding values and sk_c.
Later, the tally reporters can reconstruct the blinding values as SHAKE256(sk_i * PK_c)[...]
This alternative allows the client to transmit only a single public key, when previously it would need to transmit a complete set of blinding factors for each tally reporter. Further, the alternative does away with the need for blinding documents altogether. It is, however, more sensitive to any defects in SHAKE256 than the design above. Like the rest of this design, it would need rethinking if we want to expand this scheme to work with anonymous data collectors, such as Tor clients.