[tor-dev] Even more notes on relay-crypto constructions

Nick Mathewson nickm at alum.mit.edu
Tue Oct 9 20:52:33 UTC 2012


On Tue, Oct 9, 2012 at 12:31 PM, Robert Ransom <rransom.8774 at gmail.com> wrote:
 [...]
>>   AES-CTR + HMAC-SHA512/256.
>>
>>   AES-CTR + Poly1305.  Poly1305 requires nonces, but we can use a
>>   counter for those.
>
> Poly1305AES requires nonces.  Poly1305 itself requires
> (computationally-indistinguishable-from-) independent keys for each
> message.

Right; I meant to say the output of a stream cipher used as a PRF, but
what I meant isn't what I said.  Should've proofed more carefully

>  Please actually read the paper (see also
> http://cr.yp.to/papers.html#pema section 2.5 for how DJB uses Poly1305
> now).

I read everything I cited.  If there is something I didn't understand,
or something I missed, or something I got wrong, or something I said
wrong, that doesn't mean I didn't read the paper.

I am not going to be able to draw all the right inferences from every
paper on my own, though.  And I am *definitely*, *frequently*, going
to read papers, come up with questions, and post those questions here
sometimes even when the paper, properly understood, would answer my
questions.  If I were writing for publication, I'd want to keep all my
ideas secret until I could answer all my questions and make sure all
my answers were right, but I'm not writing for publication -- I am
writing to get feedback from other people and learn things.  Thank you
for helping me learn things!

Also thank you for reminding me to read
http://cr.yp.to/antiforgery/securitywcs-20050227.pdf again.  I'll
check it out.

>>   Salsa20 + Poly1305.
>>
>> For a padding PRNG, we could use any stream cipher we like.
>>
>> In each case, we'd want to authenticate not only the content of the
>> cell, but also the previous cell's authenticator and the position of
>> the cell within the stream.
>>
>> AES-GCM+AES is going to be the fastest on CPUs with specific support
>> for AES and GCM, but not on other architectures until/unless more
>> chips grow instructions specialized for AES and GF(2^128).
>> Salsa20+Poly1305 should be fastest on other architectures.
>>
>>
>>
>> This entire category of designs still has the problems that it had
>> before: it leaks the length of the circuit to the last node in the
>> circuit, and consumes a fairly large portion of each cell (for one
>> truncated mac per hop).  Further, it's going to be a bit fiddly
>> (though not impossible) to get it to work with rendezvous circuits.
>
> Explicitly leaking the circuit length is very very bad.

Are there any non-obvious reasons why?  Does it lead to any better
attacks than the obvious ones?

> The MAC-based designs do not mention how to prevent end-to-end tagging
> in the exit-to-client direction.  I suspect they won't try to prevent
> it at all.

That's correct.  Why would it be an attack for the exit to send a
covert signal to the client?  The exit already has valid means to send
overt signals to the client, and the client Tor is presumed not to
want to break its own anonymity.

>> II. WIDE-BLOCK DESIGNS
>>
>> Turn we now to designs where we encrypt each block of the cipher using
>> a 509-byte SPRP, such that each block's SPRP is keyed dependently on
>> the original key and on the encrypted value of the previous block.
>>
>> There be dragons here.  Let's talk about some ways we could build
>> that.  I'll start by talking about wide-block encryption, then talk
>> about getting the "unrecoverability" property where any missing or
>> corrupted block makes all future blocks unrecoverable.
>>
>>
>> The wide-block SPRP I've used before is LIONESS [Bear-Lion].  It
>> requires two keyed hash operations and two stream cipher operations,
>> so it's going to be at best half the speed of the mac-and-pad designs
>> above: a little worse, maybe, since it forces us to rekey with every
>> block.
>>
>> Might that be acceptable?  Right now, AES is not a huge fraction of
>> our runtime, and a relay does AES on each cell three times (TLS,relay
>> crypto,TLS) and SHA1 on each cell twice (TLS,TLS).  An exit does AES
>> on each cell twice (TLS,relay) and SHA1 on each cell twice
>> (TLS,relay). So for relays, with a naive SHA1-AES LIONESS, we'd be
>> increasing the stream-cipher operations by 25% and the digests by
>> 100%.  On an exit, we'd be increasing the stream-cipher operations by
>> 33% and the digests by 100%.
>>
>> If we were to go with LIONESS, we'd surely want to look into faster,
>> better keyed-hash algorithms than SHA1.   I don't know whether one of
>> the polynomial MACs above would be good enough, or whether we need
>> other cryptographic digest properties that those don't give us and
>> we'd need to turn to SHA256 or something.
>>
>>
>>
>> I've heard some suggestions that we should look into BEAR or LION
>> instead, but I'm leery of the idea.  They are faster, requiring one
>> fewer application of their primitives, but they are PRPs, not SPRPs --
>> meaning I think that they don't give you Ind-CCA2 [Bear-Lion, section
>> 6].  I don't know a way to exploit this in the Tor network, but
>> deviating from an ideal block cipher seems to me like one of those
>> ideas that's practically begging for disaster.
>
> The notion of ‘Ind-CCA2’ is defined for public-key encryption systems;
> it doesn't make sense for permutations.
>
> Even LIONESS is not an ‘ideal block cipher’ -- that term has an actual
> definition, and it is a much stronger assumption than Tor's relay
> crypto needs.  (http://cs.nyu.edu/~dodis/ps/ic-ro.pdf and
> http://cs.nyu.edu/~dodis/ps/ext-cipher.pdf contain some potentially
> useful, potentially interesting information.)
>
> Keep in mind that Tor's current relay crypto breaks *completely* if
> the circuit-extension handshake ever produces the same session key
> twice, and some parts of Tor's protocols will still require that
> assumption even if the relay crypto doesn't.  Therefore, you really
> don't need to worry about attacks that require more than one call to
> the permutation in either direction.

Ah, so I think you are claiming that LION or BEAR on its own is fine
if they are only used once with each key, and that any sane cell
chaining construction we use will involve re-keying them with each
block, so we can ignore any attack on them that would require multiple
invocations of the encryption/decryption function with the same key.

Do I understand you correctly there?

[...]
>> The multiplication operations here appear to be
>> multiplication by a primitive element, and multiplication by a per-key
>> element.  The encryption step can be realized with a somewhat
>> unorthodox counter-mode stream cipher, or a ciphertext-stealing ECB
>> approach.  I don't know what you'd need to do to substitute in an
>> orthodox stream cipher for the one used in iHCH.  Sarkar seems to see
>> iHCH as a successor to HCH, which is a little worrisome given that HCH
>> is a spiritual descendant of the patented XCB, but to me the two
>> constructions (HCH, iHCH) look practically nothing alike except for
>> their use of a counter mode step.
>>
>> I am a little intimidated by the idea of trying to implement one of
>> these ciphers.
>
> LION (implemented with ChaCha and CubeHash512, with an extra 256 bits
> of chaining output from each step) sounds like the safe-and-sane
> approach for now; changing the protocol again if the U.S. patent
> system expires won't be nearly as hard as changing the protocol the
> first time.
>
> (BLAKE-512 might be a faster hash function on some processors, but
> requires 64-bit operations.  BLAKE-256 would be faster, but then (a)
> you don't get a chaining output from any function computed directly
> over R, and (b) you still need to produce 512 bits of
> (computationally-indistinguishable-from-)independent key material from
> some hash function in the chaining step.  CubeHash512 will never be
> broken as an entropy extractor in less time than the Curve25519-based
> circuit-extension handshake.)

Interesting; I had stopped considering CubeHash when got dropped from
the SHA-3 competition.  I'll have to find out who's been working on
analyzing it since then.

Can you be more explicit about where the chaining output is taken from
when, and how exactly it gets used, in this design?

Also, why LION and not BEAR?


> (Does the XCB patent cover LION-like constructions with the ‘hash
> function’ in the middle replaced with a polynomial-evaluation
> function?  I'm not convinced that a suitable approximately-256-bit
> polynomial-evaluation function will be much faster than ChaCha,
> especially in 32-bit code.)

I haven't read the XCB patent, and won't, pending legal advice from
Wendy telling me that it's okay to read it.

 [...]
> The obvious way to implement GF(2^128) multiplication of a large
> number of secret inputs y by one learned-at-runtime secret c is:
>
> * Compute a table of c*X^i for i = 0, ..., 127.  (This table takes
> 128*128 bits, or 2048 bytes.  Multiplication by X is easy and
> tolerably fast.)
> * For each input y (= y_0*X^0 + ... + y_127*X^127, with the y_i in
> GF(2)), compute y_0*(c*X^0) + ... + y_127*(c*X^127).  (You've done
> this step for Pynchon Gate already; hopefully that runs in constant
> time.)

Hm. Did you figure out cycles-per-byte on that one?

Back-of-the-envelope:

Assuming 64-bit words, we're looking at a 128 instances of "get a bit
from y," 128*2 instances of "Load the corresponding word from the
table, then constant-time-conditionally-xor that word to the
accumulator."

The fast portable implementation of the conditional-xor I know, for a
single bit in 'b', and a value in 'x' to be conditionally xored into
'a' is:  "a ^=  x & ~(b-1)".

I *think* that's roughly 128 (and, sub, not) to get a mask for each
bit, 128 shifts, then 256 loads, 256 ands (to apply the mask), and 256
xors.  (Some of these operations are unnecessary; I am committing
fencepost errors here. I'm also assuming all loads are one-cycle.)

So that's about 1280 operations for 16 bytes of input to be multiplied
by c, so that seems like something like 80 instructions per byte for a
naive implementation. There are still probably more optimizations to
be done, especially if we have wacky SIMD features to play with, or we
can combine some of the operations above into single ones.  Dynamic
multiple issue and such CPU architectural tricks might get it down
even more.  Nevertheless, it still looks like it'd be expensive enough
getting GF(2^128) right to make GF(2^128) unattractive.

Still, maybe somebody should hack this up for the public good, whether
we turn out to need GF(2^128) or not.

>> As a further wrinkle with GF(2^128), OpenSSL doesn't seem to actually
>> expose its "multiply in GF(2^128)" functions as far as I can tell: we
>> would need to snarf such code from elsewhere.
>>
>>
>> Oh! ARMv8 has an optional crypto instruction set that supports AES,
>> SHA256, and GF(2^128) multiplication [ARMv8].  If it looks like most
>> of the ARMs we care about are going to get it, that could factor into
>> our planning.
>
> I won't believe claims that AES hardware will be widely available
> until it actually is present in every new processor from a major
> manufacturer.

Agreed.

-- 
Nick


More information about the tor-dev mailing list