On Sun, Nov 18, 2012 at 7:43 PM, Andrea Shepard andrea@torproject.org wrote:
I've made some notes on parallelizing cell crypto on relays which I've attached to this mail and added to the wiki page on this [1], which I would like comment on. Particularly, I want to resolve the following points:
- Should we operate on cell_t or packed_cell_t?
After reading your comments, I don't feel strongly that packed_cell_t is correct. Either would probably be okay. I think I might have said "packed_cell_t" on the grounds that we already have a specialized allocator for them, whereas we don't currently heap-allocate as many cell_t as we would need for this. I might also have said "packed_cell_t" on the theory that copying data from cell_t to packed_cell_t would waste cycles, but I'm not sure that's significant here. Or maybe I was thinking of the amount of space wasted in packed_cell_t vs cell_t, and worried about the number of queued cells we're likely to have at once.
For queueing the cell_ts, and where to put a next pointer: I *think* (but can't be sure) that most of our cell_t objects are currently short-lived, and most of our long-lived cell objects are currently packed_cell_t. So adding fields to cell_t wouldn't hurt much, since they don't currently represent very much of our total allocation, I think. (We should double-check that!)
- Is it acceptable for the main thread to call functions to manipulate job-dispatching data structures that need to acquire a lock if we can be sure everything that ever holds the lock only does so very briefly to manipulate those structures and will never block while holding it, or do we need to design something that's lock-free for the main thread?
I don't think the main thread needs to be lock-free so long as it won't often block for too long. I was initially planning to have locks protect the work-queues initially, and only substitute in a lock-free work-queue implementation if it turned out to matter.
(How long the lock is held by the main thread is less important for performance IIUC than whether the main thread will spend much time blocked waiting for the lock, or how much lock contention there will be.)
- The functions that currently call relay_crypt()/ relay_crypt_one_payload() will have to be split into two parts, one that queues a cell to be crypted and another that handles the finished, crypted cell. How will the worker threads cause the second part to be called in the main thread? (NickM, there's probably some libevent voodoo I don't know about yet for this?)
So, I'd suggest the we start with a work queue each way: one queue (or set of queues) to enqueue stuff for worker threads, and one to enqueue the crypted stuff for the main cell to process it.
That said, we'll need a way to alert the main thread "Hey, you have more data on your queue!" The easiest way to do that is, using Libevent 2.0 or later, to just call event_active() on a "struct event" whose callback processes the data. (This needs to require Libevent 2.0 or later, since Libevent 1.4 and earlier didn't let you do this threadsafely.)
The next easiest alternative would be to construct a socketpair, have the main thread listen on the read side of it (using a read event), and have the worker thread write a byte to the socketpair whenever the queue went from "observed" .
There are faster ways to do all of the above, and more ways we could optimize libevent to help with our particular case, but for now I'll suggest we ignore them: getting even one extra core to do crypto with will more than pay for all the overhead in a v1 implementation. We can find the painful points and optimize them later -- and waiting to do so will ensure we don't waste time optimizing outside of critical paths. In fact, we should probably be doing this for the rest of the design: looking to see if there are cases where even if we think we might need to get complex eventually, we can still do them simply at first.
(For the case of sending work to the worker threads, a mutex/condition-var pair will do. Windows Vista added real condition variables, and there's plenty of compat code around to fake them with earlier Windowses.)
I suggest that for the first implementation of this, we can just make the multithreaded crypto stuff require Libevent 2.0 or later. We'll want a compile option to turn it off for old OpenBSDs anyway.
Turning multithreaded cypto off should be easy, right? Instead of "Thread1: enqueue. Thread2: unqueue, encrypt, enqueue. Thread1: unqueue, process", you just replace the first "enqueue" operation with "encrypt and process" -- perhaps making a stub relaycrypt_dispatcher_t that doesn't actually dispatch anything, but just does the crypto operation immediately. This might even be a good intermediate step while refactoring the crypto.
Once these are resolved I'll be making a branch for this and writing the headers with proposed data structures and interfaces for review.
Cool; thank you!
Another observation about your design sketch: If we implement this nicely, I think it will be possible to stress-test it and load-test the relaycrypt_dispatcher_t part of it without actually requiring the rest of Tor to be running. IOW, we could have a performance-test for this component that didn't require any actual Tor traffic to be happening, or circuits to exist. IMO that would be a pretty good idea, since it would help us optimize as needed, and it would encourage loose coupling.
cheers, -- Nick
On Mon, Nov 19, 2012 at 05:06:58PM -0500, Nick Mathewson wrote:
On Sun, Nov 18, 2012 at 7:43 PM, Andrea Shepard andrea@torproject.org wrote:
I've made some notes on parallelizing cell crypto on relays which I've attached to this mail and added to the wiki page on this [1], which I would like comment on. Particularly, I want to resolve the following points:
- Should we operate on cell_t or packed_cell_t?
I seem to have missed the original email here. What is the link [1]?
Having just upgraded my exit node to a reasonable machine, I found that even though the AES should be super-fast (with AES-NI, etc.), the public-key ops are still killing the CPU. I split the node into two processes, but multithreading the public-key ops (to a configurable number of threads, like say 7) would appear to be a big bang-for-buck win.
- Ian
On Mon, Nov 19, 2012 at 5:18 PM, Ian Goldberg iang@cs.uwaterloo.ca wrote:
On Mon, Nov 19, 2012 at 05:06:58PM -0500, Nick Mathewson wrote:
On Sun, Nov 18, 2012 at 7:43 PM, Andrea Shepard andrea@torproject.org wrote:
I've made some notes on parallelizing cell crypto on relays which I've attached to this mail and added to the wiki page on this [1], which I would like comment on. Particularly, I want to resolve the following points:
- Should we operate on cell_t or packed_cell_t?
I seem to have missed the original email here. What is the link [1]?
[1] https://trac.torproject.org/projects/tor/wiki/org/projects/Tor/Multithreaded...
Having just upgraded my exit node to a reasonable machine, I found that even though the AES should be super-fast (with AES-NI, etc.), the public-key ops are still killing the CPU. I split the node into two processes, but multithreading the public-key ops (to a configurable number of threads, like say 7) would appear to be a big bang-for-buck win.
Circuit-related public key should already be parallelized (configurably, via the NumCPUs option). Parallelizing most SSL public key will be a little trickier, though I guess we could farm off certificate verification and the public key parts of the v3 handshake to cpuworkers too, with a nontrivial amount of effort.
We once had code that tried to track how many PK operations of each kind Tor does, for what purpose, so we could figure out which ones were costing us the most. We should try to get that working again. note_crypto_pk_op() is the core function in question -- for some incomprehensible reason, it would report its results via the directory port. Instead, it should probably use the heartbeat code or something.
Also on the crypto performance front, I'm hoping that enabling ECDHE ciphersuites for TLS will save some of the time we currently spend on PK, and that ntor with a good EC group will save some more.
On Mon, Nov 19, 2012 at 05:29:50PM -0500, Nick Mathewson wrote:
Circuit-related public key should already be parallelized (configurably, via the NumCPUs option).
Wow; I totally missed that. I've turned the node back to a single process with a 14MB limit, and set NumCPUs to 4. Let's see what happens.
Thanks!
- Ian