[tor-dev] Even more notes on relay-crypto constructions

9 Oct 2012

      I should share with the list an update of where I am with a design for
an improved relay crypto protocol.  For background and motivation,
please see the last thread on the topic [Prop202].

There are three main questions remaining for me in choosing among new
relay crypto protocols.  Basically, they are: "Am I comfortable with
this system?", "Among systems I'm comfortable with, how good is this
system?", and "Do I know how to implement this system?"

Unfortunately, the stuff I am currently comfortable with and know that
I could implement is not nearly as good as the stuff that I'm _nearly_
comfortable enough to use and I don't know how to implement.

Let's talk about some designs in detail, using the same terminology as
proposal 202.

Note:

   As usual, this is probably largely wrong.  If you find it on a
   mailing list archive years from now, please don't try to learn
   cryptography from it.

   What I'm looking for most right now is places where my reasoning is
   wrong, places where I'm mistaken about what tradeoffs I need need
   to make, and places where there's research that I should have read
   or remembered.

I. MAC-AND-PAD DESIGNS

These are the least involved to implement: all you need is an AEAD
construction and a PRNG.  Those are both pretty well-understood: you
can build a good PRNG out of any stream cipher, and you can build the
AEAD mode out of a stream cipher and a MAC, or out of various other
constructions.

As a wrinkle: it would be good to have a system that generates "two
MACs for the price of one."  That's because in our topology, we need
the ability to address relay cells to any node in the circuit, and one
easy way to do that here is to have a construction that generates two
MACs, and uses one for "cells that should arrive here" and another for
"cells that we should relay".  We also want shorter MACs: 16 bytes per
hop is excessive for our needs.

Authenticators like GCM whose output is the raw result of a polynomial
evaluation aren't safe to truncate: Ferguson [Ferguson] has a result
showing that it's easier than it should be to perform forgery against
truncated-MAC GCM constructions.

On the other hand, polynomial authenticators seem safe to truncate
(generally) if the encryption step comes _after_ the polynomial
evaluation.  See for example [Poly1305-Trunc]'s discussion of truncated
Poly1305 -- I hope it's right.

So to be concrete, let me suggest a few modes of operation.  I believe
I'm competent to implement these:

  AES-GCM+AES: AES-GCM, except that we add (modulo 2^128) the output
  of AES(nonce) to each GCM output, in hopes that it will make GCM
  safe to truncate.  (This is probably crazy somehow.)

  AES-CTR + HMAC-SHA512/256.

  AES-CTR + Poly1305.  Poly1305 requires nonces, but we can use a
  counter for those.

  Salsa20 + Poly1305.

For a padding PRNG, we could use any stream cipher we like.

In each case, we'd want to authenticate not only the content of the
cell, but also the previous cell's authenticator and the position of
the cell within the stream.

AES-GCM+AES is going to be the fastest on CPUs with specific support
for AES and GCM, but not on other architectures until/unless more
chips grow instructions specialized for AES and GF(2^128).
Salsa20+Poly1305 should be fastest on other architectures.

This entire category of designs still has the problems that it had
before: it leaks the length of the circuit to the last node in the
circuit, and consumes a fairly large portion of each cell (for one
truncated mac per hop).  Further, it's going to be a bit fiddly
(though not impossible) to get it to work with rendezvous circuits.

II. WIDE-BLOCK DESIGNS

Turn we now to designs where we encrypt each block of the cipher using
a 509-byte SPRP, such that each block's SPRP is keyed dependently on
the original key and on the encrypted value of the previous block.

There be dragons here.  Let's talk about some ways we could build
that.  I'll start by talking about wide-block encryption, then talk
about getting the "unrecoverability" property where any missing or
corrupted block makes all future blocks unrecoverable.

The wide-block SPRP I've used before is LIONESS [Bear-Lion].  It
requires two keyed hash operations and two stream cipher operations,
so it's going to be at best half the speed of the mac-and-pad designs
above: a little worse, maybe, since it forces us to rekey with every
block.

Might that be acceptable?  Right now, AES is not a huge fraction of
our runtime, and a relay does AES on each cell three times (TLS,relay
crypto,TLS) and SHA1 on each cell twice (TLS,TLS).  An exit does AES
on each cell twice (TLS,relay) and SHA1 on each cell twice
(TLS,relay). So for relays, with a naive SHA1-AES LIONESS, we'd be
increasing the stream-cipher operations by 25% and the digests by
100%.  On an exit, we'd be increasing the stream-cipher operations by
33% and the digests by 100%.

If we were to go with LIONESS, we'd surely want to look into faster,
better keyed-hash algorithms than SHA1.   I don't know whether one of
the polynomial MACs above would be good enough, or whether we need
other cryptographic digest properties that those don't give us and
we'd need to turn to SHA256 or something.

I've heard some suggestions that we should look into BEAR or LION
instead, but I'm leery of the idea.  They are faster, requiring one
fewer application of their primitives, but they are PRPs, not SPRPs --
meaning I think that they don't give you Ind-CCA2 [Bear-Lion, section
6].  I don't know a way to exploit this in the Tor network, but
deviating from an ideal block cipher seems to me like one of those
ideas that's practically begging for disaster.

Can we get faster than LIONESS?  Indeed we can!  There are a pile of
constructions.  If I understand correctly, and I'm not missing
anything, the ones we might want to use fall into two broad
categories: those that (like LIONESS) split the message into a short
bit and a long bit, then treat them differently; and those that apply
a transformation on the message, encrypt the message, and transform it
again.

The first category (split, then frob each part based on the other once
or twice) would appear to be more of a patent minefield, thanks to the
XCB patent.  The ones to look at here are [HCTR] and [HCH], which
require two applications of a universal hash function (can be
polynomial-based) and approximately one encryption pass.  HCH is a
little more unlike XCB.  Neither is exactly trivial.  We could build
either one out of whatever field and stream cipher we wanted, as
above, though bitsliced AES-CTR wouldn't be efficient in this case: we
aren't generating enough stream at once for bitslicing to pay off.

The second category (frob, encrypt, frob) is pretty elegant IMO. The
best-explained of these I've seen so far are in a
paper by Palash Sarkar [Efficient-Tweakable], though the earlier TET
construction [TET] might also be cool.  For these, you need an
invertible block-wise (Almost) (Xor-)Universal hash function,
typically implemented with GF(2^128).  I'm not sure if you could use a
different field.  The multiplication operations here appear to be
multiplication by a primitive element, and multiplication by a per-key
element.  The encryption step can be realized with a somewhat
unorthodox counter-mode stream cipher, or a ciphertext-stealing ECB
approach.  I don't know what you'd need to do to substitute in an
orthodox stream cipher for the one used in iHCH.  Sarkar seems to see
iHCH as a successor to HCH, which is a little worrisome given that HCH
is a spiritual descendant of the patented XCB, but to me the two
constructions (HCH, iHCH) look practically nothing alike except for
their use of a counter mode step.

I am a little intimidated by the idea of trying to implement one of
these ciphers.

Above, I haven't taken one of our requirements into account: that any
change to a single cell must make all future cells unrecoverable.

There are modes that are supposed to prevent this, and applying them
to a decent wide-block cipher might solve the issue. IGE is one of
them [IGE], but it turns out to be broken by an attacker who knows
some plaintext.  The Accumulated Block Chaining [ABC] construction is
supposed to fix that; I'm not too sure whether it's correct or
efficient.

As a blunt-force approach to the problem, we could rekey between each
block, using new keys based on a MAC of the last block.  This would be
pretty inefficient for primitives that need any serious amount of
per-key computation.

Finally, we could look into constructions that produce an extra secret
output incidental to their regular operation, and which are easily
rekeyed for each operation.

In this whole field, we need to keep an eye out for the patents on
CMC, EME, and XCB.  "Joy."

III. CHOICE OF CIPHERS AND OTHER PRIMITIVES

I am hearing exactly two recommendations for encryption primitives
nowadays: AES and Salsa20 (and its family).  Nobody's recommending
anything else as far as I can tell.

AES is still the IBM of the block cipher world, which "Nobody Ever Got
Fired For Using."  It's a bit of a pain to use it in software, though.
Its most obvious implementations have timing attacks due to table
lookups [DJB-Timing].  OpenSSL has three fast x86 implementations: one
of them uses vector permutations, one uses bit-slicing, and one uses
the aesni instructions to invoke the chip's built-in AES capabilities.
On high-end Intel chips, these take about 12-25, 7-9, and 1.5 cycles
per byte respectively.  The bitsliced one really needs to encrypt 4KiB
at a time (yes, kibibytes), which makes it fine for counter mode but
not so good for CBC.  (Those numbers are from the OpenSSL source.)
Key setup times are cheap for all of these (I think!), but the
bitsliced implementation gets expensive if you change the key (or
the stream position).

Salsa20 (and its descendant, ChaCha) is the Obvious Second Choice for
people who don't want to use AES.  It is faster than AES on every
platform except those with hardware support for AES; 4-7 cycles per
byte seems typical for high-end Intel stuff.  You would probably have
to go out of your way to implement it in a way that was vulnerable to
timing side-channel attacks.  The only arguments for its insecurity
that I'm aware of are that although it has fewer known flaws than AES,
it has received less attention than AES, therefore is likelier to
_unsuspected_ problems.  The counterargument there is that there _are_
several dozen cryptographers who have tried to attack it (or attack
reduced-round variants of it), several of whom have also done
successful results against AES. [DJB-Comm] Per-key setup time is
basically nonexistent.

There are a lot of constructions out there that want to do
multiplication in GF(2^128) as a basic operation.  That turns out to
be less than totally straightforward, though.  On newer intel chips,
you've got a "multiply without carry" instruction that can supposedly
let you get to around 2 cycles per byte.  On other platforms, you're
reduced to worse trickery to try to get good performance without
timing side-channels, for an amount of trickery dependent on what
operation you're trying to do exactly.

The easiest GF(2^128) operation to implement safely and quickly is
multiplication by a known compile-time value. (It's easy enough that
even I could do it.)  Next easiest is multiply a large number of
values by the same run-time-fixed value -- you do precomputation on
the value in order to generate some tables.  (The scary, fast variants
use big tables; the still-a-little-scary variants use 256-byte tables,
and get performance on high-end Intel boxes around 7-10 cycles per
byte when doing GCM.)  Full-on multiplication of two arbitrary
GF(2^128) values is slowest still.

As a further wrinkle with GF(2^128), OpenSSL doesn't seem to actually
expose its "multiply in GF(2^128)" functions as far as I can tell: we
would need to snarf such code from elsewhere.

Oh! ARMv8 has an optional crypto instruction set that supports AES,
SHA256, and GF(2^128) multiplication [ARMv8].  If it looks like most
of the ARMs we care about are going to get it, that could factor into
our planning.

IV. HOW MUCH DOES PERFORMANCE MATTER HERE ANYWAY?

A fine question.  The right place too look would IMO be at profiles
on a busy non-exit non-AESNI server, to see how much time is spent in
AES there.  Make sure you look at all the AES variants, since it's not
uncommon to see more than one used at once.

A ridiculously large amount of that time will be in bn_sqr4x_mont
and its unindicted co-conspirators , but expect this to get cut a great
deal when we move to ECC for TLS DHE and for circuit extension.

So if we're willing to make a lot of assumptions, then you can figure
out a worst-case performance impact like this.  Take the fraction of
the time spent in bn_and_friends and remove them from the profile.
(Assume that circuit creation dominates over TLS renegotiation because
hey why not.)  Then assume that the AES calls used for relay crypto's
counter mode turn into whatever new relay encryption thing we're
proposing.  See how much that change costs Tor's performance overall.

V. ACKNOWLEDGMENTS

Thanks to everybody who's schooled me about this stuff, especially
Daniel J. Bernstein, Ian Goldberg, and Robert Ransom.  Assume that the
good ideas are here due to their patience and that the bad ideas are
here due to my slowness on the uptake.  Thanks to Susan Born for a
much-needed copy-edit.  Assume that every correctly structured
sentence is one she fixed, and that the bad ones I messed up after her
editing.

VI. References

[Prop202] https://lists.torproject.org/pipermail/tor-dev/2012-June/003649.html

[Ferguson] Niels Ferguson, "Authentication weaknesses in GCM",
   2005. http://csrc.nist.gov/groups/ST/toolkit/BCM/documents/comments/CWC-GCM/Fergus...

[DJB-Timing] Daniel J. Bernstein, "Cache-timing attacks on AES", 2004.
   http://cr.yp.to/papers.html#cachetiming

[DJB-Comm] Daniel J. Bernstein, personal communication

[Poly1305-Trunc] http://osdir.com/ml/encryption.poly1305/2005-09/msg00007.html

[ARMv8] "ARMv8 Instruction Set Overview", 2011,
http://board.flatassembler.net/download.php?id=5698

[Bear-Lion] Ross Anderson and Eli Biham. "Two Practical and Provably
   Secure Block Ciphers: BEAR and LION".
   http://www.cl.cam.ac.uk/~rja14/Papers/bear-lion.pdf

[Efficient-Tweakable] Palash Sarkar, "Efficient Tweakable Enciphering
  Schemes from (Block-Wise) Universal Hash Functions". 2008.
  http://eprint.iacr.org/2008/004.pdf

[IGE] Gligor and Donescu, "On Message Integrity in Symmetric
      Encryption" has the good IGE analysis:
      http://csrc.nist.gov/groups/ST/toolkit/BCM/documents/proposedmodes/ige/ige-s...
      The original IGE proposal was by Carl Campell in an old NIST
      publication that I can't find online; the paper above has a
      reference for it if you want to chase it more.

[ABC] Lars R. Knudsen, "Block Cipher Chaining Modes of
      Operation".
http://csrc.nist.gov/groups/ST/toolkit/BCM/documents/proposedmodes/abc/abc-s...

[HCTR] Peng Wang, Dengguo Feng, and Wenling Wu. "HCTR: A
       variable-input-length enciphering mode." 2005.
       http://delta.cs.cinvestav.mx/~debrup/hctr.pdf

[HCH] Debrup Chakraborty and Palash Sarkar. "HCH: A new tweakable
      enciphering scheme using the hash-encrypt-hash approach."
      http://biblioteca.cinvestav.mx/indicadores/texto_completo/cinvestav/2006/136...

[TET] Shai Halevi. "Invertible universal hashing and the TET
      encryption mode."  2007.
      http://www.iacr.org/archive/crypto2007/46220405/46220405.pdf

[tor-dev] Even more notes on relay-crypto constructions

Nick Mathewson