I should share with the list an update of where I am with a design for an improved relay crypto protocol. For background and motivation, please see the last thread on the topic [Prop202].
There are three main questions remaining for me in choosing among new relay crypto protocols. Basically, they are: "Am I comfortable with this system?", "Among systems I'm comfortable with, how good is this system?", and "Do I know how to implement this system?"
Unfortunately, the stuff I am currently comfortable with and know that I could implement is not nearly as good as the stuff that I'm _nearly_ comfortable enough to use and I don't know how to implement.
Let's talk about some designs in detail, using the same terminology as proposal 202.
Note:
As usual, this is probably largely wrong. If you find it on a mailing list archive years from now, please don't try to learn cryptography from it.
What I'm looking for most right now is places where my reasoning is wrong, places where I'm mistaken about what tradeoffs I need need to make, and places where there's research that I should have read or remembered.
I. MAC-AND-PAD DESIGNS
These are the least involved to implement: all you need is an AEAD construction and a PRNG. Those are both pretty well-understood: you can build a good PRNG out of any stream cipher, and you can build the AEAD mode out of a stream cipher and a MAC, or out of various other constructions.
As a wrinkle: it would be good to have a system that generates "two MACs for the price of one." That's because in our topology, we need the ability to address relay cells to any node in the circuit, and one easy way to do that here is to have a construction that generates two MACs, and uses one for "cells that should arrive here" and another for "cells that we should relay". We also want shorter MACs: 16 bytes per hop is excessive for our needs.
Authenticators like GCM whose output is the raw result of a polynomial evaluation aren't safe to truncate: Ferguson [Ferguson] has a result showing that it's easier than it should be to perform forgery against truncated-MAC GCM constructions.
On the other hand, polynomial authenticators seem safe to truncate (generally) if the encryption step comes _after_ the polynomial evaluation. See for example [Poly1305-Trunc]'s discussion of truncated Poly1305 -- I hope it's right.
So to be concrete, let me suggest a few modes of operation. I believe I'm competent to implement these:
AES-GCM+AES: AES-GCM, except that we add (modulo 2^128) the output of AES(nonce) to each GCM output, in hopes that it will make GCM safe to truncate. (This is probably crazy somehow.)
AES-CTR + HMAC-SHA512/256.
AES-CTR + Poly1305. Poly1305 requires nonces, but we can use a counter for those.
Salsa20 + Poly1305.
For a padding PRNG, we could use any stream cipher we like.
In each case, we'd want to authenticate not only the content of the cell, but also the previous cell's authenticator and the position of the cell within the stream.
AES-GCM+AES is going to be the fastest on CPUs with specific support for AES and GCM, but not on other architectures until/unless more chips grow instructions specialized for AES and GF(2^128). Salsa20+Poly1305 should be fastest on other architectures.
This entire category of designs still has the problems that it had before: it leaks the length of the circuit to the last node in the circuit, and consumes a fairly large portion of each cell (for one truncated mac per hop). Further, it's going to be a bit fiddly (though not impossible) to get it to work with rendezvous circuits.
II. WIDE-BLOCK DESIGNS
Turn we now to designs where we encrypt each block of the cipher using a 509-byte SPRP, such that each block's SPRP is keyed dependently on the original key and on the encrypted value of the previous block.
There be dragons here. Let's talk about some ways we could build that. I'll start by talking about wide-block encryption, then talk about getting the "unrecoverability" property where any missing or corrupted block makes all future blocks unrecoverable.
The wide-block SPRP I've used before is LIONESS [Bear-Lion]. It requires two keyed hash operations and two stream cipher operations, so it's going to be at best half the speed of the mac-and-pad designs above: a little worse, maybe, since it forces us to rekey with every block.
Might that be acceptable? Right now, AES is not a huge fraction of our runtime, and a relay does AES on each cell three times (TLS,relay crypto,TLS) and SHA1 on each cell twice (TLS,TLS). An exit does AES on each cell twice (TLS,relay) and SHA1 on each cell twice (TLS,relay). So for relays, with a naive SHA1-AES LIONESS, we'd be increasing the stream-cipher operations by 25% and the digests by 100%. On an exit, we'd be increasing the stream-cipher operations by 33% and the digests by 100%.
If we were to go with LIONESS, we'd surely want to look into faster, better keyed-hash algorithms than SHA1. I don't know whether one of the polynomial MACs above would be good enough, or whether we need other cryptographic digest properties that those don't give us and we'd need to turn to SHA256 or something.
I've heard some suggestions that we should look into BEAR or LION instead, but I'm leery of the idea. They are faster, requiring one fewer application of their primitives, but they are PRPs, not SPRPs -- meaning I think that they don't give you Ind-CCA2 [Bear-Lion, section 6]. I don't know a way to exploit this in the Tor network, but deviating from an ideal block cipher seems to me like one of those ideas that's practically begging for disaster.
Can we get faster than LIONESS? Indeed we can! There are a pile of constructions. If I understand correctly, and I'm not missing anything, the ones we might want to use fall into two broad categories: those that (like LIONESS) split the message into a short bit and a long bit, then treat them differently; and those that apply a transformation on the message, encrypt the message, and transform it again.
The first category (split, then frob each part based on the other once or twice) would appear to be more of a patent minefield, thanks to the XCB patent. The ones to look at here are [HCTR] and [HCH], which require two applications of a universal hash function (can be polynomial-based) and approximately one encryption pass. HCH is a little more unlike XCB. Neither is exactly trivial. We could build either one out of whatever field and stream cipher we wanted, as above, though bitsliced AES-CTR wouldn't be efficient in this case: we aren't generating enough stream at once for bitslicing to pay off.
The second category (frob, encrypt, frob) is pretty elegant IMO. The best-explained of these I've seen so far are in a paper by Palash Sarkar [Efficient-Tweakable], though the earlier TET construction [TET] might also be cool. For these, you need an invertible block-wise (Almost) (Xor-)Universal hash function, typically implemented with GF(2^128). I'm not sure if you could use a different field. The multiplication operations here appear to be multiplication by a primitive element, and multiplication by a per-key element. The encryption step can be realized with a somewhat unorthodox counter-mode stream cipher, or a ciphertext-stealing ECB approach. I don't know what you'd need to do to substitute in an orthodox stream cipher for the one used in iHCH. Sarkar seems to see iHCH as a successor to HCH, which is a little worrisome given that HCH is a spiritual descendant of the patented XCB, but to me the two constructions (HCH, iHCH) look practically nothing alike except for their use of a counter mode step.
I am a little intimidated by the idea of trying to implement one of these ciphers.
Above, I haven't taken one of our requirements into account: that any change to a single cell must make all future cells unrecoverable.
There are modes that are supposed to prevent this, and applying them to a decent wide-block cipher might solve the issue. IGE is one of them [IGE], but it turns out to be broken by an attacker who knows some plaintext. The Accumulated Block Chaining [ABC] construction is supposed to fix that; I'm not too sure whether it's correct or efficient.
As a blunt-force approach to the problem, we could rekey between each block, using new keys based on a MAC of the last block. This would be pretty inefficient for primitives that need any serious amount of per-key computation.
Finally, we could look into constructions that produce an extra secret output incidental to their regular operation, and which are easily rekeyed for each operation.
In this whole field, we need to keep an eye out for the patents on CMC, EME, and XCB. "Joy."
III. CHOICE OF CIPHERS AND OTHER PRIMITIVES
I am hearing exactly two recommendations for encryption primitives nowadays: AES and Salsa20 (and its family). Nobody's recommending anything else as far as I can tell.
AES is still the IBM of the block cipher world, which "Nobody Ever Got Fired For Using." It's a bit of a pain to use it in software, though. Its most obvious implementations have timing attacks due to table lookups [DJB-Timing]. OpenSSL has three fast x86 implementations: one of them uses vector permutations, one uses bit-slicing, and one uses the aesni instructions to invoke the chip's built-in AES capabilities. On high-end Intel chips, these take about 12-25, 7-9, and 1.5 cycles per byte respectively. The bitsliced one really needs to encrypt 4KiB at a time (yes, kibibytes), which makes it fine for counter mode but not so good for CBC. (Those numbers are from the OpenSSL source.) Key setup times are cheap for all of these (I think!), but the bitsliced implementation gets expensive if you change the key (or the stream position).
Salsa20 (and its descendant, ChaCha) is the Obvious Second Choice for people who don't want to use AES. It is faster than AES on every platform except those with hardware support for AES; 4-7 cycles per byte seems typical for high-end Intel stuff. You would probably have to go out of your way to implement it in a way that was vulnerable to timing side-channel attacks. The only arguments for its insecurity that I'm aware of are that although it has fewer known flaws than AES, it has received less attention than AES, therefore is likelier to _unsuspected_ problems. The counterargument there is that there _are_ several dozen cryptographers who have tried to attack it (or attack reduced-round variants of it), several of whom have also done successful results against AES. [DJB-Comm] Per-key setup time is basically nonexistent.
There are a lot of constructions out there that want to do multiplication in GF(2^128) as a basic operation. That turns out to be less than totally straightforward, though. On newer intel chips, you've got a "multiply without carry" instruction that can supposedly let you get to around 2 cycles per byte. On other platforms, you're reduced to worse trickery to try to get good performance without timing side-channels, for an amount of trickery dependent on what operation you're trying to do exactly.
The easiest GF(2^128) operation to implement safely and quickly is multiplication by a known compile-time value. (It's easy enough that even I could do it.) Next easiest is multiply a large number of values by the same run-time-fixed value -- you do precomputation on the value in order to generate some tables. (The scary, fast variants use big tables; the still-a-little-scary variants use 256-byte tables, and get performance on high-end Intel boxes around 7-10 cycles per byte when doing GCM.) Full-on multiplication of two arbitrary GF(2^128) values is slowest still.
As a further wrinkle with GF(2^128), OpenSSL doesn't seem to actually expose its "multiply in GF(2^128)" functions as far as I can tell: we would need to snarf such code from elsewhere.
Oh! ARMv8 has an optional crypto instruction set that supports AES, SHA256, and GF(2^128) multiplication [ARMv8]. If it looks like most of the ARMs we care about are going to get it, that could factor into our planning.
IV. HOW MUCH DOES PERFORMANCE MATTER HERE ANYWAY?
A fine question. The right place too look would IMO be at profiles on a busy non-exit non-AESNI server, to see how much time is spent in AES there. Make sure you look at all the AES variants, since it's not uncommon to see more than one used at once.
A ridiculously large amount of that time will be in bn_sqr4x_mont and its unindicted co-conspirators , but expect this to get cut a great deal when we move to ECC for TLS DHE and for circuit extension.
So if we're willing to make a lot of assumptions, then you can figure out a worst-case performance impact like this. Take the fraction of the time spent in bn_and_friends and remove them from the profile. (Assume that circuit creation dominates over TLS renegotiation because hey why not.) Then assume that the AES calls used for relay crypto's counter mode turn into whatever new relay encryption thing we're proposing. See how much that change costs Tor's performance overall.
V. ACKNOWLEDGMENTS
Thanks to everybody who's schooled me about this stuff, especially Daniel J. Bernstein, Ian Goldberg, and Robert Ransom. Assume that the good ideas are here due to their patience and that the bad ideas are here due to my slowness on the uptake. Thanks to Susan Born for a much-needed copy-edit. Assume that every correctly structured sentence is one she fixed, and that the bad ones I messed up after her editing.
VI. References
[Prop202] https://lists.torproject.org/pipermail/tor-dev/2012-June/003649.html
[Ferguson] Niels Ferguson, "Authentication weaknesses in GCM", 2005. http://csrc.nist.gov/groups/ST/toolkit/BCM/documents/comments/CWC-GCM/Fergus...
[DJB-Timing] Daniel J. Bernstein, "Cache-timing attacks on AES", 2004. http://cr.yp.to/papers.html#cachetiming
[DJB-Comm] Daniel J. Bernstein, personal communication
[Poly1305-Trunc] http://osdir.com/ml/encryption.poly1305/2005-09/msg00007.html
[ARMv8] "ARMv8 Instruction Set Overview", 2011, http://board.flatassembler.net/download.php?id=5698
[Bear-Lion] Ross Anderson and Eli Biham. "Two Practical and Provably Secure Block Ciphers: BEAR and LION". http://www.cl.cam.ac.uk/~rja14/Papers/bear-lion.pdf
[Efficient-Tweakable] Palash Sarkar, "Efficient Tweakable Enciphering Schemes from (Block-Wise) Universal Hash Functions". 2008. http://eprint.iacr.org/2008/004.pdf
[IGE] Gligor and Donescu, "On Message Integrity in Symmetric Encryption" has the good IGE analysis: http://csrc.nist.gov/groups/ST/toolkit/BCM/documents/proposedmodes/ige/ige-s... The original IGE proposal was by Carl Campell in an old NIST publication that I can't find online; the paper above has a reference for it if you want to chase it more.
[ABC] Lars R. Knudsen, "Block Cipher Chaining Modes of Operation". http://csrc.nist.gov/groups/ST/toolkit/BCM/documents/proposedmodes/abc/abc-s...
[HCTR] Peng Wang, Dengguo Feng, and Wenling Wu. "HCTR: A variable-input-length enciphering mode." 2005. http://delta.cs.cinvestav.mx/~debrup/hctr.pdf
[HCH] Debrup Chakraborty and Palash Sarkar. "HCH: A new tweakable enciphering scheme using the hash-encrypt-hash approach." http://biblioteca.cinvestav.mx/indicadores/texto_completo/cinvestav/2006/136...
[TET] Shai Halevi. "Invertible universal hashing and the TET encryption mode." 2007. http://www.iacr.org/archive/crypto2007/46220405/46220405.pdf