Tor multithreaded crypto

Nick Mathewson nickm at
Fri Sep 3 19:45:10 UTC 2010

[Replying to or-dev with permission.  George is talking about the
draft ideas for multithreading and crypto as discussed on the wiki at
.  Please read that before commenting, if you can.]

On Thu, Sep 2, 2010 at 3:12 PM, George Kadianakis <desnacked at> wrote:
> Greetings Nick,
> I have some questions on the multithreading implementation.
> First of all, is it possible to use the cpuworker (obviously, with
> alterations to fill our needs) infrastructure for this?
> If we rename cpuworker_main() to cpuworker_onionskin() and generalize
> a bit the whole thing, we could add a
> assign_relay_crypt_to_cpuworker() which would get called from
> circuit_package_relay_cell() and relay_crypt() and manage the whole
> multithreaded AES crypt deal.

There are a few problems with the way cpuworkers handle communication
now that might make them a poor fit for AES crypto.  I'm not sure
whether the best move is to make them a better fit, or to just build
something else that fits to begin with.

The issues with using cpuworkers for AES crypto are:
  1) cpuworkers assume that they can run in a multiprocess or
multithreaded mode.  Our AES stuff really wants to happen in-process.
  2) Currently, all work is passed to cpuworkers over one half of a
socketpair, and their answers are passed back over the other half.
This is okay for public-key data structures, where the expense of the
crypto outweighs the cost of passing the data around, but for
stream-cipher stuff, we don't want to copy our data more than
necessary -- and copying it to and from kernel-space *twice* for every
crypto operation seems like it would probably hurt a lot.
  3) Cpuworkers have no notion of work priority.
  4) Cpuworkers are balanced by the main thread/process, which only
hands work to non-busy cpuworkers, and queues it otherwise.  This
isn't really an efficient way to assign work to workers.
  5) Even if we relax issue 1 above (so that we don't have to pass
data to cpuworkers over socketpairs) so we can relax assumption 2, we
still have cpuworkers using socketpairs as their
notification/synchronization mechanism.  That's not good either; we
can probably do an order of magnitude faster at least if we build
something using real thread synchronization primitives.

(At this point, somebody will point out that inter-process shared
memory exists.  Yes indeed it does.  Let's not got there. ;) )

> We need to create, like you said, two per-circuit cell queue
> structures, let's call them unencrypted_cell_queue_{in,out} for now
> which will mimic cell_queue_t, and possibly a
> circuit_relay_crypt_queue_t structure à la mimic onion_queue_t that
> will hold all circuits that contain unencrypted cells.
>> We will want some kind of a work-queue implementation for passing info
>> to the worker threads.
> What info should be passed to the worker threads? My naive mind says
> that we need to basically pass the arguments of
> relay_crypt_one_payload() which are:
> "crypto_cipher_env_t *cipher, char *in, int encrypt_mode"
> can't these be passed through pthread_create() in the same way that
> spawn_func() and tor_pthread_helper_fn() are currently doing it?

No, you wouldn't want to do it like that: pthread_create() is called
only once per thread, and we sure don't want to create one thread per
cell, or even one thread per circuit.  We want to get closer to O(1)
cell per CPU.  Instead, you want to put this (or something like it,
maybe at a higher level) on a work queue, then have the threads pull
off pieces of work one at a time.

For instance, you could stick individual cells on a work queue, but
then you'd have to be careful about making sure they wound up
in-order.  Alternatively, you could stick "circuits in need of getting
some cells en/decrypted" on the work queue, and have the worker
threads just encrypt cells from them in-order.  Or maybe there's some
appropriate level in between.

>> We need to keep multiple workers from messing with the same circuit at once.
> Indeed, and as you may have noticed, I'm just stating obvious points
> in my mail without touching the hot topics, like points were locking
> should be implemented.
> I haven't delved enough into concurrent programming to be able to talk
> about this yet, but I guess/hope I'll clear the matter up soon :)

The way to avoid craziness here is probably to work out the smallest
amount of info that worker threads really need to touch, and lock that

>> We don't necessarily want to crypt all circuits' cells with the same
>> priority. Rather, we will probably want to use similar calculations to
>> those used by the circuit priority logic, maybe.
> I guess this is a matter of implementing some sort of scheduling (like
> the circuit priority logic you mentioned) in our
> process_pending_relay_crypto_task() function; no?

Well, "maybe".  It could be that we'll be better off just using
round-robin here, or partitioning circuits into two or three groups,
or something.  The overhead of the pqueue code that we use for circuit
priorities is definitely a win when it comes to dividing up or-conn
bandwidth (which is comparatively expensive), but it might not be such
a huge win for dividing up CPU.

>> We should design our data structures carefully and avoid like the
>> plague anything that will force us to iterate over all circuits, or
>> over all connections. That's really slow.
> I know I'm thinking freakingly short scale at the moment, but I don't
> see this happening in our utopian implementation.
>> Alerting the main thread is a little nontrivial with pre-2.0 Libevent,
>> but we can always fake it with a socketpair.
> You mean alerting circuit_package_relay_cell() or relay_crypt() that
> their cells got encrypted? How is this usually done?

Well, you'd have another queue of work that needs to get done in the
main thread; have the workers add to it; have them alert the main
thread by waking it up whenever the queue's state goes from "empty" to
"nonempty".  The trick here is that the normal way of doing this
(using condition variables to build a work queue) doesn't work so well
since the main thread is using a select/poll/kqueue/epoll/whatever
event wakeup mechanism.  So we'll have to fake it, either with
Libevent's normal wakeup mechanisms (if we have Libevent 2.0), or via
a socketpair (if we don't).

Anyways, I hope this helps.  I'll try to come up with more
implementation detail, and answers to some of the coding questions for
this some time today or over the weekend.


More information about the tor-dev mailing list