On 10/8/12, Nick Mathewson nickm@torproject.org wrote:
I should share with the list an update of where I am with a design for an improved relay crypto protocol. For background and motivation, please see the last thread on the topic [Prop202].
There are three main questions remaining for me in choosing among new relay crypto protocols. Basically, they are: "Am I comfortable with this system?", "Among systems I'm comfortable with, how good is this system?", and "Do I know how to implement this system?"
Unfortunately, the stuff I am currently comfortable with and know that I could implement is not nearly as good as the stuff that I'm _nearly_ comfortable enough to use and I don't know how to implement.
Let's talk about some designs in detail, using the same terminology as proposal 202.
Note:
As usual, this is probably largely wrong. If you find it on a mailing list archive years from now, please don't try to learn cryptography from it.
What I'm looking for most right now is places where my reasoning is wrong, places where I'm mistaken about what tradeoffs I need need to make, and places where there's research that I should have read or remembered.
I. MAC-AND-PAD DESIGNS
These are the least involved to implement: all you need is an AEAD construction and a PRNG. Those are both pretty well-understood: you can build a good PRNG out of any stream cipher, and you can build the AEAD mode out of a stream cipher and a MAC, or out of various other constructions.
As a wrinkle: it would be good to have a system that generates "two MACs for the price of one." That's because in our topology, we need the ability to address relay cells to any node in the circuit, and one easy way to do that here is to have a construction that generates two MACs, and uses one for "cells that should arrive here" and another for "cells that we should relay". We also want shorter MACs: 16 bytes per hop is excessive for our needs.
Authenticators like GCM whose output is the raw result of a polynomial evaluation aren't safe to truncate: Ferguson [Ferguson] has a result showing that it's easier than it should be to perform forgery against truncated-MAC GCM constructions.
On the other hand, polynomial authenticators seem safe to truncate (generally) if the encryption step comes _after_ the polynomial evaluation. See for example [Poly1305-Trunc]'s discussion of truncated Poly1305 -- I hope it's right.
That reference contains a rather large amount of bullshit; if any of the results stated there are correct, it's purely by accident. Read the security proofs in http://cr.yp.to/papers.html#securitywcs and http://cr.yp.to/papers.html#poly1305 (I have told you to read them multiple times now) if you want to use a truncated polynomial-evaluation MAC.
Or, you can pass the polynomial-evaluation MAC output through a yucky messy one-way function, then truncate the result of *that*. (But note that the only suitable one-way functions that you can compute efficiently are Salsa20 or ChaCha; AES is not suitable and cannot be computed efficiently.)
So to be concrete, let me suggest a few modes of operation. I believe I'm competent to implement these:
AES-GCM+AES: AES-GCM, except that we add (modulo 2^128) the output of AES(nonce) to each GCM output, in hopes that it will make GCM safe to truncate. (This is probably crazy somehow.)
This is utter crap. AES-GCM already adds (in GF(2^128)) the output of AES(nonce) (roughly) to each GCM output; adding the same value in a different field/group (a) will not help and (b) will almost certainly totally break the construction.
AES-CTR + HMAC-SHA512/256.
AES-CTR + Poly1305. Poly1305 requires nonces, but we can use a counter for those.
Poly1305AES requires nonces. Poly1305 itself requires (computationally-indistinguishable-from-) independent keys for each message. Please actually read the paper (see also http://cr.yp.to/papers.html#pema section 2.5 for how DJB uses Poly1305 now).
Salsa20 + Poly1305.
For a padding PRNG, we could use any stream cipher we like.
In each case, we'd want to authenticate not only the content of the cell, but also the previous cell's authenticator and the position of the cell within the stream.
AES-GCM+AES is going to be the fastest on CPUs with specific support for AES and GCM, but not on other architectures until/unless more chips grow instructions specialized for AES and GF(2^128). Salsa20+Poly1305 should be fastest on other architectures.
This entire category of designs still has the problems that it had before: it leaks the length of the circuit to the last node in the circuit, and consumes a fairly large portion of each cell (for one truncated mac per hop). Further, it's going to be a bit fiddly (though not impossible) to get it to work with rendezvous circuits.
Explicitly leaking the circuit length is very very bad.
The MAC-based designs do not mention how to prevent end-to-end tagging in the exit-to-client direction. I suspect they won't try to prevent it at all.
II. WIDE-BLOCK DESIGNS
Turn we now to designs where we encrypt each block of the cipher using a 509-byte SPRP, such that each block's SPRP is keyed dependently on the original key and on the encrypted value of the previous block.
There be dragons here. Let's talk about some ways we could build that. I'll start by talking about wide-block encryption, then talk about getting the "unrecoverability" property where any missing or corrupted block makes all future blocks unrecoverable.
The wide-block SPRP I've used before is LIONESS [Bear-Lion]. It requires two keyed hash operations and two stream cipher operations, so it's going to be at best half the speed of the mac-and-pad designs above: a little worse, maybe, since it forces us to rekey with every block.
Might that be acceptable? Right now, AES is not a huge fraction of our runtime, and a relay does AES on each cell three times (TLS,relay crypto,TLS) and SHA1 on each cell twice (TLS,TLS). An exit does AES on each cell twice (TLS,relay) and SHA1 on each cell twice (TLS,relay). So for relays, with a naive SHA1-AES LIONESS, we'd be increasing the stream-cipher operations by 25% and the digests by 100%. On an exit, we'd be increasing the stream-cipher operations by 33% and the digests by 100%.
If we were to go with LIONESS, we'd surely want to look into faster, better keyed-hash algorithms than SHA1. I don't know whether one of the polynomial MACs above would be good enough, or whether we need other cryptographic digest properties that those don't give us and we'd need to turn to SHA256 or something.
I've heard some suggestions that we should look into BEAR or LION instead, but I'm leery of the idea. They are faster, requiring one fewer application of their primitives, but they are PRPs, not SPRPs -- meaning I think that they don't give you Ind-CCA2 [Bear-Lion, section 6]. I don't know a way to exploit this in the Tor network, but deviating from an ideal block cipher seems to me like one of those ideas that's practically begging for disaster.
The notion of ‘Ind-CCA2’ is defined for public-key encryption systems; it doesn't make sense for permutations.
Even LIONESS is not an ‘ideal block cipher’ -- that term has an actual definition, and it is a much stronger assumption than Tor's relay crypto needs. (http://cs.nyu.edu/~dodis/ps/ic-ro.pdf and http://cs.nyu.edu/~dodis/ps/ext-cipher.pdf contain some potentially useful, potentially interesting information.)
Keep in mind that Tor's current relay crypto breaks *completely* if the circuit-extension handshake ever produces the same session key twice, and some parts of Tor's protocols will still require that assumption even if the relay crypto doesn't. Therefore, you really don't need to worry about attacks that require more than one call to the permutation in either direction.
Can we get faster than LIONESS? Indeed we can! There are a pile of constructions. If I understand correctly, and I'm not missing anything, the ones we might want to use fall into two broad categories: those that (like LIONESS) split the message into a short bit and a long bit, then treat them differently; and those that apply a transformation on the message, encrypt the message, and transform it again.
The first category (split, then frob each part based on the other once or twice) would appear to be more of a patent minefield, thanks to the XCB patent. The ones to look at here are [HCTR] and [HCH], which require two applications of a universal hash function (can be polynomial-based) and approximately one encryption pass. HCH is a little more unlike XCB. Neither is exactly trivial. We could build either one out of whatever field and stream cipher we wanted, as above, though bitsliced AES-CTR wouldn't be efficient in this case: we aren't generating enough stream at once for bitslicing to pay off.
The second category (frob, encrypt, frob) is pretty elegant IMO. The best-explained of these I've seen so far are in a paper by Palash Sarkar [Efficient-Tweakable], though the earlier TET construction [TET] might also be cool. For these, you need an invertible block-wise (Almost) (Xor-)Universal hash function, typically implemented with GF(2^128). I'm not sure if you could use a different field.
Please actually *read* http://cr.yp.to/papers.html#securitywcs this time (read the appendix first). If you use polynomial evaluation over a different field, your ‘hash function’ will have small differential properties with respect to addition *in that field*. The Poly1305 paper then proves that the polynomial-evaluation part of Poly1305 also has small differential properties with respect to addition in Z/(2^128)Z .
In short, you can use a different field for polynomial evaluation *if* you also use a different addition operation.
(If you're going to pass the result of the polynomial-evaluation function through a one-way function so that you can tee off some bits for a chaining output, you can use whatever addition operation you want after the OWF.)
The multiplication operations here appear to be multiplication by a primitive element, and multiplication by a per-key element. The encryption step can be realized with a somewhat unorthodox counter-mode stream cipher, or a ciphertext-stealing ECB approach. I don't know what you'd need to do to substitute in an orthodox stream cipher for the one used in iHCH. Sarkar seems to see iHCH as a successor to HCH, which is a little worrisome given that HCH is a spiritual descendant of the patented XCB, but to me the two constructions (HCH, iHCH) look practically nothing alike except for their use of a counter mode step.
I am a little intimidated by the idea of trying to implement one of these ciphers.
LION (implemented with ChaCha and CubeHash512, with an extra 256 bits of chaining output from each step) sounds like the safe-and-sane approach for now; changing the protocol again if the U.S. patent system expires won't be nearly as hard as changing the protocol the first time.
(BLAKE-512 might be a faster hash function on some processors, but requires 64-bit operations. BLAKE-256 would be faster, but then (a) you don't get a chaining output from any function computed directly over R, and (b) you still need to produce 512 bits of (computationally-indistinguishable-from-)independent key material from some hash function in the chaining step. CubeHash512 will never be broken as an entropy extractor in less time than the Curve25519-based circuit-extension handshake.)
(Does the XCB patent cover LION-like constructions with the ‘hash function’ in the middle replaced with a polynomial-evaluation function? I'm not convinced that a suitable approximately-256-bit polynomial-evaluation function will be much faster than ChaCha, especially in 32-bit code.)
Above, I haven't taken one of our requirements into account: that any change to a single cell must make all future cells unrecoverable.
There are modes that are supposed to prevent this, and applying them to a decent wide-block cipher might solve the issue. IGE is one of them [IGE], but it turns out to be broken by an attacker who knows some plaintext. The Accumulated Block Chaining [ABC] construction is supposed to fix that; I'm not too sure whether it's correct or efficient.
As a blunt-force approach to the problem, we could rekey between each block, using new keys based on a MAC of the last block. This would be pretty inefficient for primitives that need any serious amount of per-key computation.
Use a ‘PRF’, not a MAC. (And don't use primitives that need per-key precomputation.)
Finally, we could look into constructions that produce an extra secret output incidental to their regular operation, and which are easily rekeyed for each operation.
This is the only option that you actually have a design for.
You'll still need to process the chaining outputs from LION with a PRF (i.e. keyed hash function).
In this whole field, we need to keep an eye out for the patents on CMC, EME, and XCB. "Joy."
III. CHOICE OF CIPHERS AND OTHER PRIMITIVES
I am hearing exactly two recommendations for encryption primitives nowadays: AES and Salsa20 (and its family). Nobody's recommending anything else as far as I can tell.
AES is still the IBM of the block cipher world, which "Nobody Ever Got Fired For Using." It's a bit of a pain to use it in software, though. Its most obvious implementations have timing attacks due to table lookups [DJB-Timing]. OpenSSL has three fast x86 implementations: one of them uses vector permutations, one uses bit-slicing, and one uses the aesni instructions to invoke the chip's built-in AES capabilities. On high-end Intel chips, these take about 12-25, 7-9, and 1.5 cycles per byte respectively. The bitsliced one really needs to encrypt 4KiB at a time (yes, kibibytes), which makes it fine for counter mode but not so good for CBC. (Those numbers are from the OpenSSL source.) Key setup times are cheap for all of these (I think!), but the bitsliced implementation gets expensive if you change the key (or the stream position).
Salsa20 (and its descendant, ChaCha) is the Obvious Second Choice for people who don't want to use AES. It is faster than AES on every platform except those with hardware support for AES; 4-7 cycles per byte seems typical for high-end Intel stuff. You would probably have to go out of your way to implement it in a way that was vulnerable to timing side-channel attacks. The only arguments for its insecurity that I'm aware of are that although it has fewer known flaws than AES, it has received less attention than AES, therefore is likelier to _unsuspected_ problems. The counterargument there is that there _are_ several dozen cryptographers who have tried to attack it (or attack reduced-round variants of it), several of whom have also done successful results against AES. [DJB-Comm] Per-key setup time is basically nonexistent.
Per-key setup time consists of one vector addition (of the first half of the key into the constants). (DJB's timing reports in http://cr.yp.to/papers.html#chacha repeat this operation for each block, but I would rather do it once per key if at all possible.)
There are a lot of constructions out there that want to do multiplication in GF(2^128) as a basic operation. That turns out to be less than totally straightforward, though. On newer intel chips, you've got a "multiply without carry" instruction that can supposedly let you get to around 2 cycles per byte. On other platforms, you're reduced to worse trickery to try to get good performance without timing side-channels, for an amount of trickery dependent on what operation you're trying to do exactly.
The easiest GF(2^128) operation to implement safely and quickly is multiplication by a known compile-time value. (It's easy enough that even I could do it.) Next easiest is multiply a large number of values by the same run-time-fixed value -- you do precomputation on the value in order to generate some tables. (The scary, fast variants use big tables; the still-a-little-scary variants use 256-byte tables, and get performance on high-end Intel boxes around 7-10 cycles per byte when doing GCM.) Full-on multiplication of two arbitrary GF(2^128) values is slowest still.
The obvious way to implement GF(2^128) multiplication of a large number of secret inputs y by one learned-at-runtime secret c is:
* Compute a table of c*X^i for i = 0, ..., 127. (This table takes 128*128 bits, or 2048 bytes. Multiplication by X is easy and tolerably fast.) * For each input y (= y_0*X^0 + ... + y_127*X^127, with the y_i in GF(2)), compute y_0*(c*X^0) + ... + y_127*(c*X^127). (You've done this step for Pynchon Gate already; hopefully that runs in constant time.)
As a further wrinkle with GF(2^128), OpenSSL doesn't seem to actually expose its "multiply in GF(2^128)" functions as far as I can tell: we would need to snarf such code from elsewhere.
Oh! ARMv8 has an optional crypto instruction set that supports AES, SHA256, and GF(2^128) multiplication [ARMv8]. If it looks like most of the ARMs we care about are going to get it, that could factor into our planning.
I won't believe claims that AES hardware will be widely available until it actually is present in every new processor from a major manufacturer.
Robert Ransom