[tor-dev] Request for comments on notes on parallel cell crypto

Mon Nov 19 22:06:58 UTC 2012

On Sun, Nov 18, 2012 at 7:43 PM, Andrea Shepard <andrea at torproject.org> wrote:
> I've made some notes on parallelizing cell crypto on relays which I've
> attached to this mail and added to the wiki page on this [1], which I
> would like comment on.  Particularly, I want to resolve the following
> points:
>
>  * Should we operate on cell_t or packed_cell_t?

After reading your comments, I don't feel strongly that packed_cell_t
is correct.  Either would probably be okay.  I think I might have said
"packed_cell_t" on the grounds that we already have a specialized
allocator for them, whereas we don't currently heap-allocate as many
cell_t as we would need for this.  I might also have said
"packed_cell_t" on the theory that copying data from cell_t to
packed_cell_t would waste cycles, but I'm not sure that's significant
here.  Or maybe I was thinking of the amount of space wasted in
packed_cell_t vs cell_t, and worried about the number of queued cells
we're likely to have at once.

For queueing the cell_ts, and where to put a next pointer: I *think*
(but can't be sure) that most of our cell_t objects are currently
short-lived, and most of our long-lived cell objects are currently
packed_cell_t.  So adding fields to cell_t wouldn't hurt much, since
they don't currently represent very much of our total allocation, I
think.  (We should double-check that!)

>  * Is it acceptable for the main thread to call functions to
>    manipulate job-dispatching data structures that need to
>    acquire a lock if we can be sure everything that ever
>    holds the lock only does so very briefly to manipulate
>    those structures and will never block while holding it,
>    or do we need to design something that's lock-free for
>    the main thread?

I don't think the main thread needs to be lock-free so long as it
won't often block for too long.  I was initially planning to have
locks protect the work-queues initially, and only substitute in a
lock-free work-queue implementation if it turned out to matter.

(How long the lock is held by the main thread is less important for
performance IIUC than whether the main thread will spend much time
blocked waiting for the lock, or how much lock contention there will
be.)

>  * The functions that currently call relay_crypt()/
>    relay_crypt_one_payload() will have to be split into
>    two parts, one that queues a cell to be crypted and
>    another that handles the finished, crypted cell.
>    How will the worker threads cause the second part to
>    be called in the main thread?  (NickM, there's probably
>    some libevent voodoo I don't know about yet for this?)

So, I'd suggest the we start with a work queue each way: one queue (or
set of queues) to enqueue stuff for worker threads, and one to enqueue
the crypted stuff for the main cell to process it.

That said, we'll need a way to alert the main thread "Hey, you have
more data on your queue!"  The easiest way to do that is, using
Libevent 2.0 or later, to just call event_active() on a "struct event"
whose callback processes the data.  (This needs to require Libevent
2.0 or later, since Libevent 1.4 and earlier didn't let you do this
threadsafely.)

The next easiest alternative would be to construct a socketpair, have
the main thread listen on the read side of it (using a read event),
and have the worker thread write a byte to the socketpair whenever the
queue went from "observed" .

There are faster ways to do all of the above, and more ways we could
optimize libevent to help with our particular case, but for now I'll
suggest we ignore them: getting even one extra core to do crypto with
will more than pay for all the overhead in a v1 implementation.  We
can find the painful points and optimize them later -- and waiting to
do so will ensure we don't waste time optimizing outside of critical
paths.  In fact, we should probably be doing this for the rest of the
design: looking to see if there are cases where even if we think we
might need to get complex eventually, we can still do them simply at
first.

(For the case of sending work to the worker threads, a
mutex/condition-var pair will do.  Windows Vista added real condition
variables, and there's plenty of compat code around to fake them with
earlier Windowses.)

I suggest that for the first implementation of this, we can just make
the multithreaded crypto stuff require Libevent 2.0 or later.  We'll
want a compile option to turn it off for old OpenBSDs anyway.

Turning multithreaded cypto off should be easy, right?  Instead of
"Thread1: enqueue. Thread2: unqueue, encrypt, enqueue. Thread1:
unqueue, process", you just replace the first "enqueue" operation with
"encrypt and process" -- perhaps making a stub relaycrypt_dispatcher_t
that doesn't actually dispatch anything, but just does the crypto
operation immediately.  This might even be a good intermediate step
while refactoring the crypto.

> Once these are resolved I'll be making a branch for this and
> writing the headers with proposed data structures and interfaces
> for review.

Cool; thank you!

Another observation about your design sketch: If we implement this
nicely, I think it will be possible to stress-test it and load-test
the relaycrypt_dispatcher_t part of it without actually requiring the
rest of Tor to be running.  IOW, we could have a performance-test for
this component that didn't require any actual Tor traffic to be
happening, or circuits to exist.  IMO that would be a pretty good
idea, since it would help us optimize as needed, and it would
encourage loose coupling.

cheers,
--
Nick