[tor-dev] Even more notes on relay-crypto constructions

Tue Oct 9 16:31:55 UTC 2012

On 10/8/12, Nick Mathewson <nickm at torproject.org> wrote:
> I should share with the list an update of where I am with a design for
> an improved relay crypto protocol.  For background and motivation,
> please see the last thread on the topic [Prop202].
>
> There are three main questions remaining for me in choosing among new
> relay crypto protocols.  Basically, they are: "Am I comfortable with
> this system?", "Among systems I'm comfortable with, how good is this
> system?", and "Do I know how to implement this system?"
>
> Unfortunately, the stuff I am currently comfortable with and know that
> I could implement is not nearly as good as the stuff that I'm _nearly_
> comfortable enough to use and I don't know how to implement.
>
> Let's talk about some designs in detail, using the same terminology as
> proposal 202.
>
> Note:
>
>    As usual, this is probably largely wrong.  If you find it on a
>    mailing list archive years from now, please don't try to learn
>    cryptography from it.
>
>    What I'm looking for most right now is places where my reasoning is
>    wrong, places where I'm mistaken about what tradeoffs I need need
>    to make, and places where there's research that I should have read
>    or remembered.
>
> I. MAC-AND-PAD DESIGNS
>
> These are the least involved to implement: all you need is an AEAD
> construction and a PRNG.  Those are both pretty well-understood: you
> can build a good PRNG out of any stream cipher, and you can build the
> AEAD mode out of a stream cipher and a MAC, or out of various other
> constructions.
>
> As a wrinkle: it would be good to have a system that generates "two
> MACs for the price of one."  That's because in our topology, we need
> the ability to address relay cells to any node in the circuit, and one
> easy way to do that here is to have a construction that generates two
> MACs, and uses one for "cells that should arrive here" and another for
> "cells that we should relay".  We also want shorter MACs: 16 bytes per
> hop is excessive for our needs.
>
> Authenticators like GCM whose output is the raw result of a polynomial
> evaluation aren't safe to truncate: Ferguson [Ferguson] has a result
> showing that it's easier than it should be to perform forgery against
> truncated-MAC GCM constructions.
>
> On the other hand, polynomial authenticators seem safe to truncate
> (generally) if the encryption step comes _after_ the polynomial
> evaluation.  See for example [Poly1305-Trunc]'s discussion of truncated
> Poly1305 -- I hope it's right.

That reference contains a rather large amount of bullshit; if any of
the results stated there are correct, it's purely by accident.  Read
the security proofs in http://cr.yp.to/papers.html#securitywcs and
http://cr.yp.to/papers.html#poly1305 (I have told you to read them
multiple times now) if you want to use a truncated
polynomial-evaluation MAC.

Or, you can pass the polynomial-evaluation MAC output through a yucky
messy one-way function, then truncate the result of *that*.  (But note
that the only suitable one-way functions that you can compute
efficiently are Salsa20 or ChaCha; AES is not suitable and cannot be
computed efficiently.)

> So to be concrete, let me suggest a few modes of operation.  I believe
> I'm competent to implement these:
>
>   AES-GCM+AES: AES-GCM, except that we add (modulo 2^128) the output
>   of AES(nonce) to each GCM output, in hopes that it will make GCM
>   safe to truncate.  (This is probably crazy somehow.)

This is utter crap.  AES-GCM already adds (in GF(2^128)) the output of
AES(nonce) (roughly) to each GCM output; adding the same value in a
different field/group (a) will not help and (b) will almost certainly
totally break the construction.

>   AES-CTR + HMAC-SHA512/256.
>
>   AES-CTR + Poly1305.  Poly1305 requires nonces, but we can use a
>   counter for those.

Poly1305AES requires nonces.  Poly1305 itself requires
(computationally-indistinguishable-from-) independent keys for each
message.  Please actually read the paper (see also
http://cr.yp.to/papers.html#pema section 2.5 for how DJB uses Poly1305
now).

>   Salsa20 + Poly1305.
>
> For a padding PRNG, we could use any stream cipher we like.
>
> In each case, we'd want to authenticate not only the content of the
> cell, but also the previous cell's authenticator and the position of
> the cell within the stream.
>
> AES-GCM+AES is going to be the fastest on CPUs with specific support
> for AES and GCM, but not on other architectures until/unless more
> chips grow instructions specialized for AES and GF(2^128).
> Salsa20+Poly1305 should be fastest on other architectures.
>
>
>
> This entire category of designs still has the problems that it had
> before: it leaks the length of the circuit to the last node in the
> circuit, and consumes a fairly large portion of each cell (for one
> truncated mac per hop).  Further, it's going to be a bit fiddly
> (though not impossible) to get it to work with rendezvous circuits.

Explicitly leaking the circuit length is very very bad.

The MAC-based designs do not mention how to prevent end-to-end tagging
in the exit-to-client direction.  I suspect they won't try to prevent
it at all.

> II. WIDE-BLOCK DESIGNS
>
> Turn we now to designs where we encrypt each block of the cipher using
> a 509-byte SPRP, such that each block's SPRP is keyed dependently on
> the original key and on the encrypted value of the previous block.
>
> There be dragons here.  Let's talk about some ways we could build
> that.  I'll start by talking about wide-block encryption, then talk
> about getting the "unrecoverability" property where any missing or
> corrupted block makes all future blocks unrecoverable.
>
>
> The wide-block SPRP I've used before is LIONESS [Bear-Lion].  It
> requires two keyed hash operations and two stream cipher operations,
> so it's going to be at best half the speed of the mac-and-pad designs
> above: a little worse, maybe, since it forces us to rekey with every
> block.
>
> Might that be acceptable?  Right now, AES is not a huge fraction of
> our runtime, and a relay does AES on each cell three times (TLS,relay
> crypto,TLS) and SHA1 on each cell twice (TLS,TLS).  An exit does AES
> on each cell twice (TLS,relay) and SHA1 on each cell twice
> (TLS,relay). So for relays, with a naive SHA1-AES LIONESS, we'd be
> increasing the stream-cipher operations by 25% and the digests by
> 100%.  On an exit, we'd be increasing the stream-cipher operations by
> 33% and the digests by 100%.
>
> If we were to go with LIONESS, we'd surely want to look into faster,
> better keyed-hash algorithms than SHA1.   I don't know whether one of
> the polynomial MACs above would be good enough, or whether we need
> other cryptographic digest properties that those don't give us and
> we'd need to turn to SHA256 or something.
>
>
>
> I've heard some suggestions that we should look into BEAR or LION
> instead, but I'm leery of the idea.  They are faster, requiring one
> fewer application of their primitives, but they are PRPs, not SPRPs --
> meaning I think that they don't give you Ind-CCA2 [Bear-Lion, section
> 6].  I don't know a way to exploit this in the Tor network, but
> deviating from an ideal block cipher seems to me like one of those
> ideas that's practically begging for disaster.

The notion of ‘Ind-CCA2’ is defined for public-key encryption systems;
it doesn't make sense for permutations.

Even LIONESS is not an ‘ideal block cipher’ -- that term has an actual
definition, and it is a much stronger assumption than Tor's relay
crypto needs.  (http://cs.nyu.edu/~dodis/ps/ic-ro.pdf and
http://cs.nyu.edu/~dodis/ps/ext-cipher.pdf contain some potentially
useful, potentially interesting information.)

Keep in mind that Tor's current relay crypto breaks *completely* if
the circuit-extension handshake ever produces the same session key
twice, and some parts of Tor's protocols will still require that
assumption even if the relay crypto doesn't.  Therefore, you really
don't need to worry about attacks that require more than one call to
the permutation in either direction.

> Can we get faster than LIONESS?  Indeed we can!  There are a pile of
> constructions.  If I understand correctly, and I'm not missing
> anything, the ones we might want to use fall into two broad
> categories: those that (like LIONESS) split the message into a short
> bit and a long bit, then treat them differently; and those that apply
> a transformation on the message, encrypt the message, and transform it
> again.
>
> The first category (split, then frob each part based on the other once
> or twice) would appear to be more of a patent minefield, thanks to the
> XCB patent.  The ones to look at here are [HCTR] and [HCH], which
> require two applications of a universal hash function (can be
> polynomial-based) and approximately one encryption pass.  HCH is a
> little more unlike XCB.  Neither is exactly trivial.  We could build
> either one out of whatever field and stream cipher we wanted, as
> above, though bitsliced AES-CTR wouldn't be efficient in this case: we
> aren't generating enough stream at once for bitslicing to pay off.
>
> The second category (frob, encrypt, frob) is pretty elegant IMO. The
> best-explained of these I've seen so far are in a
> paper by Palash Sarkar [Efficient-Tweakable], though the earlier TET
> construction [TET] might also be cool.  For these, you need an
> invertible block-wise (Almost) (Xor-)Universal hash function,
> typically implemented with GF(2^128).  I'm not sure if you could use a
> different field.

Please actually *read* http://cr.yp.to/papers.html#securitywcs this
time (read the appendix first).  If you use polynomial evaluation over
a different field, your ‘hash function’ will have small differential
properties with respect to addition *in that field*.  The Poly1305
paper then proves that the polynomial-evaluation part of Poly1305 also
has small differential properties with respect to addition in
Z/(2^128)Z .

In short, you can use a different field for polynomial evaluation *if*
you also use a different addition operation.

(If you're going to pass the result of the polynomial-evaluation
function through a one-way function so that you can tee off some bits
for a chaining output, you can use whatever addition operation you
want after the OWF.)

>  The multiplication operations here appear to be
> multiplication by a primitive element, and multiplication by a per-key
> element.  The encryption step can be realized with a somewhat
> unorthodox counter-mode stream cipher, or a ciphertext-stealing ECB
> approach.  I don't know what you'd need to do to substitute in an
> orthodox stream cipher for the one used in iHCH.  Sarkar seems to see
> iHCH as a successor to HCH, which is a little worrisome given that HCH
> is a spiritual descendant of the patented XCB, but to me the two
> constructions (HCH, iHCH) look practically nothing alike except for
> their use of a counter mode step.
>
> I am a little intimidated by the idea of trying to implement one of
> these ciphers.

LION (implemented with ChaCha and CubeHash512, with an extra 256 bits
of chaining output from each step) sounds like the safe-and-sane
approach for now; changing the protocol again if the U.S. patent
system expires won't be nearly as hard as changing the protocol the
first time.

(BLAKE-512 might be a faster hash function on some processors, but
requires 64-bit operations.  BLAKE-256 would be faster, but then (a)
you don't get a chaining output from any function computed directly
over R, and (b) you still need to produce 512 bits of
(computationally-indistinguishable-from-)independent key material from
some hash function in the chaining step.  CubeHash512 will never be
broken as an entropy extractor in less time than the Curve25519-based
circuit-extension handshake.)

(Does the XCB patent cover LION-like constructions with the ‘hash
function’ in the middle replaced with a polynomial-evaluation
function?  I'm not convinced that a suitable approximately-256-bit
polynomial-evaluation function will be much faster than ChaCha,
especially in 32-bit code.)

> Above, I haven't taken one of our requirements into account: that any
> change to a single cell must make all future cells unrecoverable.
>
> There are modes that are supposed to prevent this, and applying them
> to a decent wide-block cipher might solve the issue. IGE is one of
> them [IGE], but it turns out to be broken by an attacker who knows
> some plaintext.  The Accumulated Block Chaining [ABC] construction is
> supposed to fix that; I'm not too sure whether it's correct or
> efficient.
>
> As a blunt-force approach to the problem, we could rekey between each
> block, using new keys based on a MAC of the last block.  This would be
> pretty inefficient for primitives that need any serious amount of
> per-key computation.

Use a ‘PRF’, not a MAC.  (And don't use primitives that need per-key
precomputation.)

> Finally, we could look into constructions that produce an extra secret
> output incidental to their regular operation, and which are easily
> rekeyed for each operation.

This is the only option that you actually have a design for.

You'll still need to process the chaining outputs from LION with a PRF
(i.e. keyed hash function).

> In this whole field, we need to keep an eye out for the patents on
> CMC, EME, and XCB.  "Joy."
>
>
>
> III. CHOICE OF CIPHERS AND OTHER PRIMITIVES
>
> I am hearing exactly two recommendations for encryption primitives
> nowadays: AES and Salsa20 (and its family).  Nobody's recommending
> anything else as far as I can tell.
>
> AES is still the IBM of the block cipher world, which "Nobody Ever Got
> Fired For Using."  It's a bit of a pain to use it in software, though.
> Its most obvious implementations have timing attacks due to table
> lookups [DJB-Timing].  OpenSSL has three fast x86 implementations: one
> of them uses vector permutations, one uses bit-slicing, and one uses
> the aesni instructions to invoke the chip's built-in AES capabilities.
> On high-end Intel chips, these take about 12-25, 7-9, and 1.5 cycles
> per byte respectively.  The bitsliced one really needs to encrypt 4KiB
> at a time (yes, kibibytes), which makes it fine for counter mode but
> not so good for CBC.  (Those numbers are from the OpenSSL source.)
> Key setup times are cheap for all of these (I think!), but the
> bitsliced implementation gets expensive if you change the key (or
> the stream position).
>
>
> Salsa20 (and its descendant, ChaCha) is the Obvious Second Choice for
> people who don't want to use AES.  It is faster than AES on every
> platform except those with hardware support for AES; 4-7 cycles per
> byte seems typical for high-end Intel stuff.  You would probably have
> to go out of your way to implement it in a way that was vulnerable to
> timing side-channel attacks.  The only arguments for its insecurity
> that I'm aware of are that although it has fewer known flaws than AES,
> it has received less attention than AES, therefore is likelier to
> _unsuspected_ problems.  The counterargument there is that there _are_
> several dozen cryptographers who have tried to attack it (or attack
> reduced-round variants of it), several of whom have also done
> successful results against AES. [DJB-Comm] Per-key setup time is
> basically nonexistent.

Per-key setup time consists of one vector addition (of the first half
of the key into the constants).  (DJB's timing reports in
http://cr.yp.to/papers.html#chacha repeat this operation for each
block, but I would rather do it once per key if at all possible.)

> There are a lot of constructions out there that want to do
> multiplication in GF(2^128) as a basic operation.  That turns out to
> be less than totally straightforward, though.  On newer intel chips,
> you've got a "multiply without carry" instruction that can supposedly
> let you get to around 2 cycles per byte.  On other platforms, you're
> reduced to worse trickery to try to get good performance without
> timing side-channels, for an amount of trickery dependent on what
> operation you're trying to do exactly.
>
> The easiest GF(2^128) operation to implement safely and quickly is
> multiplication by a known compile-time value. (It's easy enough that
> even I could do it.)  Next easiest is multiply a large number of
> values by the same run-time-fixed value -- you do precomputation on
> the value in order to generate some tables.  (The scary, fast variants
> use big tables; the still-a-little-scary variants use 256-byte tables,
> and get performance on high-end Intel boxes around 7-10 cycles per
> byte when doing GCM.)  Full-on multiplication of two arbitrary
> GF(2^128) values is slowest still.

The obvious way to implement GF(2^128) multiplication of a large
number of secret inputs y by one learned-at-runtime secret c is:

* Compute a table of c*X^i for i = 0, ..., 127.  (This table takes
128*128 bits, or 2048 bytes.  Multiplication by X is easy and
tolerably fast.)
* For each input y (= y_0*X^0 + ... + y_127*X^127, with the y_i in
GF(2)), compute y_0*(c*X^0) + ... + y_127*(c*X^127).  (You've done
this step for Pynchon Gate already; hopefully that runs in constant
time.)

> As a further wrinkle with GF(2^128), OpenSSL doesn't seem to actually
> expose its "multiply in GF(2^128)" functions as far as I can tell: we
> would need to snarf such code from elsewhere.
>
>
> Oh! ARMv8 has an optional crypto instruction set that supports AES,
> SHA256, and GF(2^128) multiplication [ARMv8].  If it looks like most
> of the ARMs we care about are going to get it, that could factor into
> our planning.

I won't believe claims that AES hardware will be widely available
until it actually is present in every new processor from a major
manufacturer.

Robert Ransom