[tor-dev] Even more notes on relay-crypto constructions

Robert Ransom rransom.8774 at gmail.com
Wed Oct 10 03:16:28 UTC 2012

On 10/9/12, Nick Mathewson <nickm at alum.mit.edu> wrote:
> On Tue, Oct 9, 2012 at 12:31 PM, Robert Ransom <rransom.8774 at gmail.com>
> wrote:
>  [...]
>>>   AES-CTR + HMAC-SHA512/256.
>>>   AES-CTR + Poly1305.  Poly1305 requires nonces, but we can use a
>>>   counter for those.
>> Poly1305AES requires nonces.  Poly1305 itself requires
>> (computationally-indistinguishable-from-) independent keys for each
>> message.
> Right; I meant to say the output of a stream cipher used as a PRF, but
> what I meant isn't what I said.  Should've proofed more carefully
>>  Please actually read the paper (see also
>> http://cr.yp.to/papers.html#pema section 2.5 for how DJB uses Poly1305
>> now).
> I read everything I cited.  If there is something I didn't understand,
> or something I missed, or something I got wrong, or something I said
> wrong, that doesn't mean I didn't read the paper.
> I am not going to be able to draw all the right inferences from every
> paper on my own, though.  And I am *definitely*, *frequently*, going
> to read papers, come up with questions, and post those questions here
> sometimes even when the paper, properly understood, would answer my
> questions.  If I were writing for publication, I'd want to keep all my
> ideas secret until I could answer all my questions and make sure all
> my answers were right, but I'm not writing for publication -- I am
> writing to get feedback from other people and learn things.

You have been talking about using Poly1305 truncated to 64 bits for
weeks.  It is truly not difficult to find Theorem 3.3 in the Poly1305
paper and figure out that the polynomial-evaluation part of Poly1305
truncated to its first (least significant) 64 bits has differential
probabilities of at most 2^(-34) for Tor-cell-size messages (and thus
Poly1305 has a probability of forgery of at most 2^(-34) when
truncated to its first 64 bits).

>>> This entire category of designs still has the problems that it had
>>> before: it leaks the length of the circuit to the last node in the
>>> circuit, and consumes a fairly large portion of each cell (for one
>>> truncated mac per hop).  Further, it's going to be a bit fiddly
>>> (though not impossible) to get it to work with rendezvous circuits.
>> Explicitly leaking the circuit length is very very bad.
> Are there any non-obvious reasons why?  Does it lead to any better
> attacks than the obvious ones?

* Cannibalized circuits are one hop longer than non-cannibalized
circuits; knowing that a particular circuit was cannibalized leaks
information about the client's previous exit-selection behaviour.

* Some users might want to use alternate path-selection algorithms
(e.g. http://freehaven.net/anonbib/#ccs2011-trust ); they might not
want to leak the fact that they are using such algorithms to the
non-first relays in their entry-guard chains.

>> The MAC-based designs do not mention how to prevent end-to-end tagging
>> in the exit-to-client direction.  I suspect they won't try to prevent
>> it at all.
> That's correct.  Why would it be an attack for the exit to send a
> covert signal to the client?  The exit already has valid means to send
> overt signals to the client, and the client Tor is presumed not to
> want to break its own anonymity.

* Mallory's malicious entry and exit relays suspect (based on timing
correlation) that they control both ends of a circuit, and want to
confirm that.  The exit tags a cell it sends to the client by XORing
it with a secret random constant; the entry XORs (what it believes is)
that cell with the same constant.  If the circuit survives, Mallory's
relays know that they control both ends.

* Mallory doesn't want to pay the bandwidth costs for non-compromised
traffic, so his/her/its entry and exit relays tag *every* circuit's
first exit-to-client cell with the same secret random constant.  (You
added a crapload of bugs to 0.2.3.x based on Mike Perry's claim that
someone is likely to actually perform an ‘attack’ of this form.)

>>> I've heard some suggestions that we should look into BEAR or LION
>>> instead, but I'm leery of the idea.  They are faster, requiring one
>>> fewer application of their primitives, but they are PRPs, not SPRPs --
>>> meaning I think that they don't give you Ind-CCA2 [Bear-Lion, section
>>> 6].  I don't know a way to exploit this in the Tor network, but
>>> deviating from an ideal block cipher seems to me like one of those
>>> ideas that's practically begging for disaster.
>> The notion of ‘Ind-CCA2’ is defined for public-key encryption systems;
>> it doesn't make sense for permutations.
>> Even LIONESS is not an ‘ideal block cipher’ -- that term has an actual
>> definition, and it is a much stronger assumption than Tor's relay
>> crypto needs.  (http://cs.nyu.edu/~dodis/ps/ic-ro.pdf and
>> http://cs.nyu.edu/~dodis/ps/ext-cipher.pdf contain some potentially
>> useful, potentially interesting information.)
>> Keep in mind that Tor's current relay crypto breaks *completely* if
>> the circuit-extension handshake ever produces the same session key
>> twice, and some parts of Tor's protocols will still require that
>> assumption even if the relay crypto doesn't.  Therefore, you really
>> don't need to worry about attacks that require more than one call to
>> the permutation in either direction.
> Ah, so I think you are claiming that LION or BEAR on its own is fine
> if they are only used once with each key, and that any sane cell
> chaining construction we use will involve re-keying them with each
> block, so we can ignore any attack on them that would require multiple
> invocations of the encryption/decryption function with the same key.
> Do I understand you correctly there?


> [...]
>>> The multiplication operations here appear to be
>>> multiplication by a primitive element, and multiplication by a per-key
>>> element.  The encryption step can be realized with a somewhat
>>> unorthodox counter-mode stream cipher, or a ciphertext-stealing ECB
>>> approach.  I don't know what you'd need to do to substitute in an
>>> orthodox stream cipher for the one used in iHCH.  Sarkar seems to see
>>> iHCH as a successor to HCH, which is a little worrisome given that HCH
>>> is a spiritual descendant of the patented XCB, but to me the two
>>> constructions (HCH, iHCH) look practically nothing alike except for
>>> their use of a counter mode step.
>>> I am a little intimidated by the idea of trying to implement one of
>>> these ciphers.
>> LION (implemented with ChaCha and CubeHash512, with an extra 256 bits
>> of chaining output from each step) sounds like the safe-and-sane
>> approach for now; changing the protocol again if the U.S. patent
>> system expires won't be nearly as hard as changing the protocol the
>> first time.
>> (BLAKE-512 might be a faster hash function on some processors, but
>> requires 64-bit operations.  BLAKE-256 would be faster, but then (a)
>> you don't get a chaining output from any function computed directly
>> over R, and (b) you still need to produce 512 bits of
>> (computationally-indistinguishable-from-)independent key material from
>> some hash function in the chaining step.  CubeHash512 will never be
>> broken as an entropy extractor in less time than the Curve25519-based
>> circuit-extension handshake.)
> Interesting; I had stopped considering CubeHash when got dropped from
> the SHA-3 competition.  I'll have to find out who's been working on
> analyzing it since then.
> Can you be more explicit about where the chaining output is taken from
> when, and how exactly it gets used, in this design?

For encryption:

* Split the 512-byte plaintext cell into 480-byte R and 32-byte L.
(The names are reversed to match the original description of LION.)
* Let header be the first 3 bytes of R; replace the first 3 bytes of R
with zeros.
* Let C_1 be the first 32 bytes of ChaCha(L XOR K_1, “LION p.1”); let
T be the next 480 bytes.
* XOR T into R.  Replace the first 3 bytes of R with zeros.
* Let T be the first 32 bytes of CubeHash512(R); let C_2 be the other 32 bytes.
* XOR T into L.
* Let C_3 be the first 32 bytes of ChaCha(L XOR K_2, “LION p.2”); let
T be the next 480 bytes.
* XOR T into R.  Replace the first 3 bytes of R with header.
* The ciphertext is concat(R, L).

(Decryption is easy, and produces the same C_i.)

To compute keys for the next block:

* Compute CubeHash512(concat(K_3, C_1, C_2, C_3)); let K_1 be its
first 32 bytes; let K_2 be the next 32 bytes.

The chaining step absolutely requires 512 bits of output.  That could
be obtained by computing one ChaCha block using the output of
BLAKE-256 as a key, but CubeHash is simpler, at least as secure, and
probably as fast.

> Also, why LION and not BEAR?

ChaCha12 is definitely faster than BLAKE-256 or CubeHash, and you
aren't going to use a reduced-round variant of a hash function.

>  [...]
>> The obvious way to implement GF(2^128) multiplication of a large
>> number of secret inputs y by one learned-at-runtime secret c is:
>> * Compute a table of c*X^i for i = 0, ..., 127.  (This table takes
>> 128*128 bits, or 2048 bytes.  Multiplication by X is easy and
>> tolerably fast.)
>> * For each input y (= y_0*X^0 + ... + y_127*X^127, with the y_i in
>> GF(2)), compute y_0*(c*X^0) + ... + y_127*(c*X^127).  (You've done
>> this step for Pynchon Gate already; hopefully that runs in constant
>> time.)
> Hm. Did you figure out cycles-per-byte on that one?

Nope.  This was just a proof-of-concept.  (It also provides an upper
bound on the amount of table that a secure implementation can
reasonably use; anything with a larger table will need to do too much

> Back-of-the-envelope:
> Assuming 64-bit words, we're looking at a 128 instances of "get a bit
> from y," 128*2 instances of "Load the corresponding word from the
> table, then constant-time-conditionally-xor that word to the
> accumulator."

All 64-bit processors have SIMD instructions.  In general, consider
32-bit words and 4-element (128-bit) vectors of 32-bit words; those
are efficient enough everywhere.

> The fast portable implementation of the conditional-xor I know, for a
> single bit in 'b', and a value in 'x' to be conditionally xored into
> 'a' is:  "a ^=  x & ~(b-1)".

Invert y (bitwise) first (in one vector instruction or four word
instructions); then you can skip inverting each b-1 .  (Someone may
need to patch your Pynchon Gate code.)

> I *think* that's roughly 128 (and, sub, not) to get a mask for each
> bit, 128 shifts, then 256 loads, 256 ands (to apply the mask), and 256
> xors.  (Some of these operations are unnecessary; I am committing
> fencepost errors here. I'm also assuming all loads are one-cycle.)
> So that's about 1280 operations for 16 bytes of input to be multiplied
> by c, so that seems like something like 80 instructions per byte for a
> naive implementation. There are still probably more optimizations to
> be done, especially if we have wacky SIMD features to play with, or we
> can combine some of the operations above into single ones.  Dynamic
> multiple issue and such CPU architectural tricks might get it down
> even more.  Nevertheless, it still looks like it'd be expensive enough
> getting GF(2^128) right to make GF(2^128) unattractive.

With the not operation for each bit removed, the 64-bit-word operation
count drops to around 1024.

With 32-bit words, that's 512 each of load, and, xor ; the total
number of instructions is around 1920 on a plain 32-bit processor.

With SIMD, that's 128 each of load, and, xor , but the ‘compute a mask
for each bit’ step is annoying and processor-specific, so it's not
much better in terms of instruction count than 64-bit words.  However,
the table can be computed in a slightly different format to make it
easier to compute masks in the SIMD registers; that should get down to
not much more than 768 instructions.

> Still, maybe somebody should hack this up for the public good, whether
> we turn out to need GF(2^128) or not.

Look for a better algorithm first -- there should be one.

Robert Ransom

More information about the tor-dev mailing list