Design changes for connection-level (TLS) key rotation

Tue Apr 20 19:51:58 UTC 2004

We want to periodically rotate the TLS keys between tor nodes. This
isn't just updating the symmetric key for the TLS connection; we want
to change the asymmetric key too, to get perfect secrecy after the
rotation. The way to do this is to periodically build new connections,
and migrate existing circuits over to them.

Right now, the code expects there will be at most one connection between
two onion routers, or between an onion router and an onion proxy. If a
second one is established (finishes its TLS handshake), then the receiver
closes it.

So we need three changes:

1) Change the accessors for finding a connection for a given router,
so they return the "best" one to use.
2) Change tls_finish_handshake so when a new connection arrives, it
initiates the shutdown of all non-best connections.
3) Periodically decide that an old connection should be rotated. Launch
a new connection, and migrate any remaining circuits to the new one.

First issue: what if we get a race, where two ORs decide at the same time
to launch a new conn? Which one is 'best'? We break these ties by leaving
the responsibility to the OR with the lexicographically smaller nickname.
It's his job to decide to build a new conn, and his job to decide which
conns should be expired. The higher-nicknamed OR can choose to simply
close a connection which should have been rotated but hasn't been. In
the case of an OP-OR connection, the OP has the responsibility for
rotating conns.

Second issue: how do we migrate circuits? Fortunately, the cells
between two ORs don't have to be in order. We just need to make sure
that the cells _within a given circuit_ are in the right order. So now
let circs point to two conns: one for receiving cells and one for sending
cells. When it's time to migrate a circuit, the OR sends a SHUTDOWN cell
down the conn. It then repoints the sending-conn of all circs on that
conn to the new one. When a conn receives a shutdown cell, he repoints
both conns of the circs, responds with a shutdown cell, and closes
the socket. When the final shutdown cell arrives to the original OR, he
repoints the last conn and is done. The original OR also repoints the last
conn if the socket closes, and closes it himself if he's waited too long.

(So circ'ids now have to be unique between two nodes, not just on a
single conn.)

Third issue: what about cells that want to be sent before the shutdown
has been acknowledged? Or cells from the other guy that arrive on the
new conn before his acknowledgement has arrived? We need to be able to
receive cells from a circ on any conn. One approach is to add a counter
to the cell header (on a per-circuit basis or overall). Then we could
simply switch over to the new conn, and the ORs would know whether they
missed any cells.

But let's back up a second. If we're going to do counters, then we
definitely need queues on each side (to handle reordering cells). But
queues on each side might be sufficient by themselves. If we batch up
the cells we get on the 'wrong' conns, then when we get a shutdown cell
we can process them all. This way we can avoid implementing much more
of TCP ourselves.

On the third hand, having counters would give us more flexibility. We
could explore Dan Kaminsky's idea of having hotswap links (when one jams,
migrate to another right quick). We'd be closer to being able to queue and
resend cells end-to-end, which would allow us to tolerate node failures
in the middle of the circuit.

I guess we should stick to our standard approach, which is to only build
the infrastructure we need when we know we need it. That means no counters
for now, and just a few extra queues.

Other perspectives?
--Roger