On 10/9/12, Nick Mathewson nickm@alum.mit.edu wrote:
On Tue, Oct 9, 2012 at 12:31 PM, Robert Ransom rransom.8774@gmail.com wrote: [...]
AES-CTR + HMAC-SHA512/256.
AES-CTR + Poly1305. Poly1305 requires nonces, but we can use a counter for those.
Poly1305AES requires nonces. Poly1305 itself requires (computationally-indistinguishable-from-) independent keys for each message.
Right; I meant to say the output of a stream cipher used as a PRF, but what I meant isn't what I said. Should've proofed more carefully
Please actually read the paper (see also http://cr.yp.to/papers.html#pema section 2.5 for how DJB uses Poly1305 now).
I read everything I cited. If there is something I didn't understand, or something I missed, or something I got wrong, or something I said wrong, that doesn't mean I didn't read the paper.
I am not going to be able to draw all the right inferences from every paper on my own, though. And I am *definitely*, *frequently*, going to read papers, come up with questions, and post those questions here sometimes even when the paper, properly understood, would answer my questions. If I were writing for publication, I'd want to keep all my ideas secret until I could answer all my questions and make sure all my answers were right, but I'm not writing for publication -- I am writing to get feedback from other people and learn things.
You have been talking about using Poly1305 truncated to 64 bits for weeks. It is truly not difficult to find Theorem 3.3 in the Poly1305 paper and figure out that the polynomial-evaluation part of Poly1305 truncated to its first (least significant) 64 bits has differential probabilities of at most 2^(-34) for Tor-cell-size messages (and thus Poly1305 has a probability of forgery of at most 2^(-34) when truncated to its first 64 bits).
This entire category of designs still has the problems that it had before: it leaks the length of the circuit to the last node in the circuit, and consumes a fairly large portion of each cell (for one truncated mac per hop). Further, it's going to be a bit fiddly (though not impossible) to get it to work with rendezvous circuits.
Explicitly leaking the circuit length is very very bad.
Are there any non-obvious reasons why? Does it lead to any better attacks than the obvious ones?
* Cannibalized circuits are one hop longer than non-cannibalized circuits; knowing that a particular circuit was cannibalized leaks information about the client's previous exit-selection behaviour.
* Some users might want to use alternate path-selection algorithms (e.g. http://freehaven.net/anonbib/#ccs2011-trust ); they might not want to leak the fact that they are using such algorithms to the non-first relays in their entry-guard chains.
The MAC-based designs do not mention how to prevent end-to-end tagging in the exit-to-client direction. I suspect they won't try to prevent it at all.
That's correct. Why would it be an attack for the exit to send a covert signal to the client? The exit already has valid means to send overt signals to the client, and the client Tor is presumed not to want to break its own anonymity.
* Mallory's malicious entry and exit relays suspect (based on timing correlation) that they control both ends of a circuit, and want to confirm that. The exit tags a cell it sends to the client by XORing it with a secret random constant; the entry XORs (what it believes is) that cell with the same constant. If the circuit survives, Mallory's relays know that they control both ends.
* Mallory doesn't want to pay the bandwidth costs for non-compromised traffic, so his/her/its entry and exit relays tag *every* circuit's first exit-to-client cell with the same secret random constant. (You added a crapload of bugs to 0.2.3.x based on Mike Perry's claim that someone is likely to actually perform an ‘attack’ of this form.)
I've heard some suggestions that we should look into BEAR or LION instead, but I'm leery of the idea. They are faster, requiring one fewer application of their primitives, but they are PRPs, not SPRPs -- meaning I think that they don't give you Ind-CCA2 [Bear-Lion, section 6]. I don't know a way to exploit this in the Tor network, but deviating from an ideal block cipher seems to me like one of those ideas that's practically begging for disaster.
The notion of ‘Ind-CCA2’ is defined for public-key encryption systems; it doesn't make sense for permutations.
Even LIONESS is not an ‘ideal block cipher’ -- that term has an actual definition, and it is a much stronger assumption than Tor's relay crypto needs. (http://cs.nyu.edu/~dodis/ps/ic-ro.pdf and http://cs.nyu.edu/~dodis/ps/ext-cipher.pdf contain some potentially useful, potentially interesting information.)
Keep in mind that Tor's current relay crypto breaks *completely* if the circuit-extension handshake ever produces the same session key twice, and some parts of Tor's protocols will still require that assumption even if the relay crypto doesn't. Therefore, you really don't need to worry about attacks that require more than one call to the permutation in either direction.
Ah, so I think you are claiming that LION or BEAR on its own is fine if they are only used once with each key, and that any sane cell chaining construction we use will involve re-keying them with each block, so we can ignore any attack on them that would require multiple invocations of the encryption/decryption function with the same key.
Do I understand you correctly there?
Yes.
[...]
The multiplication operations here appear to be multiplication by a primitive element, and multiplication by a per-key element. The encryption step can be realized with a somewhat unorthodox counter-mode stream cipher, or a ciphertext-stealing ECB approach. I don't know what you'd need to do to substitute in an orthodox stream cipher for the one used in iHCH. Sarkar seems to see iHCH as a successor to HCH, which is a little worrisome given that HCH is a spiritual descendant of the patented XCB, but to me the two constructions (HCH, iHCH) look practically nothing alike except for their use of a counter mode step.
I am a little intimidated by the idea of trying to implement one of these ciphers.
LION (implemented with ChaCha and CubeHash512, with an extra 256 bits of chaining output from each step) sounds like the safe-and-sane approach for now; changing the protocol again if the U.S. patent system expires won't be nearly as hard as changing the protocol the first time.
(BLAKE-512 might be a faster hash function on some processors, but requires 64-bit operations. BLAKE-256 would be faster, but then (a) you don't get a chaining output from any function computed directly over R, and (b) you still need to produce 512 bits of (computationally-indistinguishable-from-)independent key material from some hash function in the chaining step. CubeHash512 will never be broken as an entropy extractor in less time than the Curve25519-based circuit-extension handshake.)
Interesting; I had stopped considering CubeHash when got dropped from the SHA-3 competition. I'll have to find out who's been working on analyzing it since then.
Can you be more explicit about where the chaining output is taken from when, and how exactly it gets used, in this design?
For encryption:
* Split the 512-byte plaintext cell into 480-byte R and 32-byte L. (The names are reversed to match the original description of LION.) * Let header be the first 3 bytes of R; replace the first 3 bytes of R with zeros. * Let C_1 be the first 32 bytes of ChaCha(L XOR K_1, “LION p.1”); let T be the next 480 bytes. * XOR T into R. Replace the first 3 bytes of R with zeros. * Let T be the first 32 bytes of CubeHash512(R); let C_2 be the other 32 bytes. * XOR T into L. * Let C_3 be the first 32 bytes of ChaCha(L XOR K_2, “LION p.2”); let T be the next 480 bytes. * XOR T into R. Replace the first 3 bytes of R with header. * The ciphertext is concat(R, L).
(Decryption is easy, and produces the same C_i.)
To compute keys for the next block:
* Compute CubeHash512(concat(K_3, C_1, C_2, C_3)); let K_1 be its first 32 bytes; let K_2 be the next 32 bytes.
The chaining step absolutely requires 512 bits of output. That could be obtained by computing one ChaCha block using the output of BLAKE-256 as a key, but CubeHash is simpler, at least as secure, and probably as fast.
Also, why LION and not BEAR?
ChaCha12 is definitely faster than BLAKE-256 or CubeHash, and you aren't going to use a reduced-round variant of a hash function.
[...]
The obvious way to implement GF(2^128) multiplication of a large number of secret inputs y by one learned-at-runtime secret c is:
- Compute a table of c*X^i for i = 0, ..., 127. (This table takes
128*128 bits, or 2048 bytes. Multiplication by X is easy and tolerably fast.)
- For each input y (= y_0*X^0 + ... + y_127*X^127, with the y_i in
GF(2)), compute y_0*(c*X^0) + ... + y_127*(c*X^127). (You've done this step for Pynchon Gate already; hopefully that runs in constant time.)
Hm. Did you figure out cycles-per-byte on that one?
Nope. This was just a proof-of-concept. (It also provides an upper bound on the amount of table that a secure implementation can reasonably use; anything with a larger table will need to do too much I/O.)
Back-of-the-envelope:
Assuming 64-bit words, we're looking at a 128 instances of "get a bit from y," 128*2 instances of "Load the corresponding word from the table, then constant-time-conditionally-xor that word to the accumulator."
All 64-bit processors have SIMD instructions. In general, consider 32-bit words and 4-element (128-bit) vectors of 32-bit words; those are efficient enough everywhere.
The fast portable implementation of the conditional-xor I know, for a single bit in 'b', and a value in 'x' to be conditionally xored into 'a' is: "a ^= x & ~(b-1)".
Invert y (bitwise) first (in one vector instruction or four word instructions); then you can skip inverting each b-1 . (Someone may need to patch your Pynchon Gate code.)
I *think* that's roughly 128 (and, sub, not) to get a mask for each bit, 128 shifts, then 256 loads, 256 ands (to apply the mask), and 256 xors. (Some of these operations are unnecessary; I am committing fencepost errors here. I'm also assuming all loads are one-cycle.)
So that's about 1280 operations for 16 bytes of input to be multiplied by c, so that seems like something like 80 instructions per byte for a naive implementation. There are still probably more optimizations to be done, especially if we have wacky SIMD features to play with, or we can combine some of the operations above into single ones. Dynamic multiple issue and such CPU architectural tricks might get it down even more. Nevertheless, it still looks like it'd be expensive enough getting GF(2^128) right to make GF(2^128) unattractive.
With the not operation for each bit removed, the 64-bit-word operation count drops to around 1024.
With 32-bit words, that's 512 each of load, and, xor ; the total number of instructions is around 1920 on a plain 32-bit processor.
With SIMD, that's 128 each of load, and, xor , but the ‘compute a mask for each bit’ step is annoying and processor-specific, so it's not much better in terms of instruction count than 64-bit words. However, the table can be computed in a slightly different format to make it easier to compute masks in the SIMD registers; that should get down to not much more than 768 instructions.
Still, maybe somebody should hack this up for the public good, whether we turn out to need GF(2^128) or not.
Look for a better algorithm first -- there should be one.
Robert Ransom