Parallelizing (some of) Tor's AES Operations

Mon Apr 21 16:34:04 UTC 2008

[replying on or-dev with permission]

On Mon, Apr 21, 2008 at 02:29:05AM -0400, Matt Edman wrote:
 [...]
> 
> I've been sticking to trying to parallelize encryption of cells on circuits.
> Trying to get all OpenSSL crypto parallelized would be great, but it also
> seems quite a bit trickier and perhaps too ambitious for this small project. 

Agreed.  Sadly, though, OpenSSL AES is 2/3 of all symmetric crypto
done on a typical relay: each cell gets openssl-decrypted on the way
in, processed with relay_crypt, then openssl-encrypted on the way out.
For circuits at an exit node, OpenSSL's AES is only 1/2 of symmetric
crypto done at a relay, and for circuits at the client it's even
higher.

> Here are some of the parallelization approaches I've been pondering:
>
>  1) The most straightforward approach I can think of is to launch
> $NumCpus threads when Tor starts up. When aes_crypt() or
> aes_crypt_inplace() get called from the main thread, the input gets
> split into $length/$NumCpus chunks (or maybe fewer depending on
> input length), which get farmed out to the worker threads.  The main
> thread then collects the results and returns.
>
> A potential downside to this straightforward approach is that the input is
> likely often so small that perhaps the overhead from mutex locking and thread
> signaling dominates any potential gain from the parallelization. It may also
> be a little unbalanced since the input length isn't often a multiple of the
> number of threads (e.g., 509 byte cell sizes). The latter concern probably
> isn't so critical though.

Right.  Personally, I don't expect that this will wind up working very
well in practice, what with the locking but if you want to benchmark
it, it might work out well after all.  Having the main thread wait
while all the subthreads are doing AES seems like a losing proposition.

> 2) I also took a look at a parallel AES-CTR implementation done by some folks
> at the Pittsburgh Supercomputing Center:
> 
>   http://www.psc.edu/networking/projects/hpn-ssh/openssh5.0-CTR-threading.diff
> 
>
> The gist of their approach is to launch two threads for every
> EVP_CIPHER_CTX when it is created. The threads handle pregenerating
> keystream blocks in the background. The main thread reads from one
> of the pregenerated buffers they call keystream queues (there is one
> queue for every thread, plus two), marks the queue as draining, and
> signals the worker threads to buffer up some more blocks if
> necessary. The advantage of having multiple keystream queues is that
> while the main thread is reading from one queue, the others can be
> getting refilled by the worker threads.
>
> I'm not sure how many aes_cnt_cipher_t's a busy Tor relay has, but I'm
> guessing that launching two threads for every one of them results in...a lot
> of threads.  Their performance results for OpenSSH are impressive, but this
> approach may not be exactly appropriate for Tor.

A relay server has around 2 aes_cnt_cipher_t instances per active
circuit, so let's just skip to the next one below.

> 3) A modification of their idea is to only launch $NumCpus threads when Tor
> starts up, rather than a couple for every aes_cnt_cipher_t. Each
> aes_cnt_cipher_t still maintains a couple keystream queues that get filled by
> the worker threads in the background. The PSC implementation allocates 4096
> bytes for each keystream queue. Clearly the downside here is memory
> consumption. I know you've been working to reduce Tor's memory consumption,
> and this change would seem to go in the opposite direction. I'm guessing the
> 4096 bytes could be tuned down quite a bit, but I think we'd still like to
> buffer up at least 512 bytes so we can encrypt an entire cell without having
> to wait for the threads to refill a keystream queue.

Right.  The success of this approach would depend totally on
predicting in advance which circuits are going to want to encrypt lots
of cells and which aren't.  Also, you'd need some way for the main
thread to tell the threads, "I have a bunch of cells waiting for
circuit X, but there's no pregenerated stream left or that circuit, so
please prioritize X's AES stream generation over other circuits'."

I like the latency win on this approach in the ideal case, but the RAM
hit seems nasty, and I worry about increased latency in the case where
the main thread has to wait for more stream.

>  4) A fourth approach would be to farm entire cells out to
> threads. Admittedly, I don't have a complete grasp on the flow of
> cells through the code yet. From looking at relay.c, it looks like
> this would entail (very vaguely) farming out
> relay_crypto_one_payload() and then continuing normal processing of
> the cell when the crypto is done. If I'm understanding correctly, I
> guess we would also need to take care to avoid situations in which
> two cells from the same circuit get farmed out to separate threads,
> but get append_cell_to_circuit_queue()ed out of order. IMO, this
> would be the way to go, but it's not clear to me at the moment how
> to allow processing of cells on other circuits to continue while
> waiting for the crypto to complete on a cell on one circuit.

I agree this is the best way to go.  Before you start, you should
check out the way that Tor handles cells in circuits right now:
that'll make your job even easier.

First off, ignore cells generated and received by clients.  Client
performance matters, but servers are likelier to benefit from
parallelization, to have multiple CPUs, and to be CPU-bound. 

Basics: every connection has two buffers: one for ingoing data and one
for outgoing data.  For servers, cells are received on OR connections
over TLS.  Once OpenSSL has decrypted the connection data, the cells
are in now in their incoming OR connection's input buffer, or "inbuf".

Having read data into the inbuf makes Tor call
connection_or_process_cells_from_inbuf() in connection_or.c.  This
pulls a cell off the buffer and calls command_process_cell() in
command.c.  If the cell is a relay cell, we go to
command_process_relay_cell() and the magic starts.

command_process_relay_cell() doees the following:
  - Given the connection that the cell came in on, and the cell's ciruit
    ID, it looks up the appropriate or_circuit_t for the circuit.
  - It passes the cell to circuit_receive_relay_cell(), which does the
    following:
    - It calls relay_crypt() to actually do the crypto on the cell.[**]
    - If the cell is intended for this hop on the circuit (that is,
      we're the exit or we're the client), pass it off to the
      appropriate edge stream.
    - Otherwise, find the next or_conn in sequence that should get
      this newly decrypted or encrypted cell.
    - Call append_cell_to_ciruit_queue(), which will queue the
      cell on the circuit's cell queue and do some housekeeping.

Note that there's exactly one place in the entire code where
relay_crypt() is called, marked [**] above.  To get things to
parallelize effectively, I think you'd want to replace the call to
relay_crypt, and have it instead place the cell in a work queue for
the worker threads to process.  When the worker threads were done with
one or more cell, you'd need a way for them to tell the main thread
about them, and which circuit's cell queue to put them in.

Alternatively, you could give each circuit _two_ cell queues for each
direction: one for cells that need to be crypted, and one for cells
that are already crypted and are ready to transmit.  All you'd need to
tell the workers about is which circuits have to be processed; all
you'd need to tell the main thread about is which circuits now have
more data.  I like this approach a little better because it doesn't
require quite so much bookkeeping to tell the workers about what keys
to use to crypt what, or to tell the main thread where to put cells.

It will definitely take some thinking to figure out exactly what kind
of locking and notification mechanisms you'd want to use here.  If you
have any questions about waking the main thread, or what kind of
integrity Tor's data structures need, just ask.

This sounds like a really fun project in any case, and one I'd very
much like to include in the next Tor release.  What timeframe will you
be doing this on?

Many thanks,
-- 
Nick