Title: Multithreaded cryptographic scheme for Tor relay cells Author: David Goulet Created: 18 September 2012 Status: Open The master bug for this proposal is #1749. Overview: This document proposes a multithreaded design for cells cryptographic jobs. The goal here is to take advantage of SMP machines especially on powerful servers running Tor exit nodes nowadays. Moreover, nowadays, almost all OSes running Tor nodes support multithreading which should be use to increase Tor performance. TODO: OUTLINE 1. Background 1.1 Overview of current cell's crypto As stated in [WikiThreadCrypto], three main kinds of crypto occurs: 1) Onion skin processing (already parallelized on the server side). 2) SSL handshakes and crypto 3) AES-crypto on cells This proposal focuses on the third kind which is direct AES cell cryptography jobs (encryption and decryption). The relay cell crypto is done in relay_crypt_one_payload(), which gets called at a few places, all of which are fortunately in relay.c making things easier to change. * circuit_package_relay_cell(), it gets called 1..N times for a cell that we're packaging from an edge connection and about to stick on a cell queue. * relay_crypt(...), we just got a relay cell on a circuit, and we are about to decrypt it, see whether it should get handled by us, and maybe pass it on or maybe process it. If we handle it, we are going to pass it to connection_edge_process_relay_cell(...). If we pass it on, we are going to append it to a cell_queue_t. This function is called by circuit_package_relay_cell(...) which is used when we just received a relay cell on a circuit. For pure relays, the relay_crypt() case dominates. For exit nodes, the circuit_package_relay_cell() case matters too. The whole concept of using a new threading model is to offload the crypto part from the main loop and, has stated, take advantage of SMP systems. 2. Design Since cells order matters on a circuit, it makes sense to create two new priority queues in the circuit. * Crypto cell queue The cell has NOT been processed for encryption/decryption task. * Transport cell queue The cell is ready to be sent and has passed through the crypto cell queue. The following schema shows the cell movement once it arrives on circuit A where Cq is the Crypto queue and Tq is the Transport queue. This is as simple as it gets. The tricky and fun part of this scheme is describe in details in section 3, threading and synchronization. circuit A +----------------------------------+ +--------+ enqueue | +------+------+ +-------+ | | cell c |---------->| Cq | | | ... | c | | +--------+ | +------+------+ +-------+ | +----------------------------------+ | dispatch(Cq, Tq) | Thread pool v +-------+-------+ +--------+--------+ | | | ... | | circ A | +-------+-------+ +--------+--------+ | exec crypto job() <--- | v +----------------------------------+ +--------+ dequeue | +-------+ +------+------+ | | cell c |<----------| | c | ... | | | Tq | +--------+ | +-------+ +------+------+ | +----------------------------------+ circuit A It goes like this. Enqueue the cell c of circuit A to its crypto queue (Cq). Each circuits are dispatched to a thread pool where they are assigned to an available thread that process N (value is TBD) cells from the Cq and enqueue them in Tq once done. The main loop of Tor only needs to empty out the transport queue of each circuit at each round it does. The rest is handled by the crypto thread pool. This thread pool model is describe in detail in the next section. 3. Threading and Synchronization The basic idea is to dispatch relay cell's cryptographic jobs to a thread pool upon reception to offload the Tor main loop. The thread pool should handle circuits rather than cells. Reordering cells after encryption is an unnecessary steps that we absolutely want to avoid for complexity and latency reasons. The next section describes the threading model and dynamics between the dispatcher and the pool (worker threads). 3.1 Threading model This section first explains how the dispatcher mechanism works and describes the worker thread of the pool in the following section. 3.1.1 Dispatcher The dispatch mechanism is part of the main loop and has two roles. First, it assigns the received cells to the corresponding crypto queue in its circuit. We have to make sure that the dispatcher is the one and only one writer of that queue so we can avoid complex synchronization between threads and keep the design simple for this proposal. New circuit are also managed by the dispatcher to assign it to a worker in the thread pool. The second role is to keep a priority queue of all worker threads ordered by the number of cells they are currently handling in all of their crypto queues. After adding a cell or a new circuit to a thread, this queue is rebalanced. This rebalancing process is tricky since we might need to steal circuit from other threads and push them back to new ones. This process might be hard on locks contention but for this first proposal it's should be good enough. Section 3.3 describes the priority queue data structure proposed for the dispatcher. 3.1.2 Worker thread This thread is quite simpler. Each of them contains a priority queue of circuits ordered by the EWMA value so the quietest circuit gets handled first which is an important design requirements for the current state of Tor. As proposed by Nick Mathewson, it might also be okay to do the EWMA implementation strictly when adding stuff to the transport queue (outbound). Tests should be done to see if it could make things faster. Once the thread has CPU time, it chooses N cells from the crypto priority queue depending on each circuit load, process them one at a time and move them to the transport queue (Tq) of the corresponding circuit once done. This N value has to be determined with tests. Another possibility is to make this value change depending on the rate of the relay or the CPU power available. Since the cells insertion is done by the dispatcher and we want to put as less work as we can to it, the priority queue balancing operation should be done in the worker thread. So, after N cells round of work, the queue should be rebalanced since the dispatcher could have added more. This can be verified by setting a simple flag in the priority queue. 3.2 Locking scheme For this first draft I will not propose some lockless mechanism since this kind of changes is non trivial and maybe unnecessary for a first implementation. So the locking scheme is a simple lock for each proposed queue. We can improve over time and also during the stabilization process. 3.3 Data structures For the priority queue, the minimum value of the queue is the most important node since we have to process the quietest circuit ordered by the number of the EWMA value. A binary heap makes sense since the minimum value can be looked up in O(1) (called min-heap). This structure is simple, known and can be improved afterwards by all sort of funky structures such as the Fibonacci heap or the Brodal queue [BrodalQueue]. 4. Conclusion In a nutshell, this proposal is probably not the most efficient way to do multithreading crypto jobs in Tor but it's goal is to add a simple and scalable implementation that can be improved over time with testing, measurements and better algorithms such as more efficient data structures or lockless mechanism. 5. Acknowledgments TODO References: ---------------- [WikiThreadCrypto] Multithreaded crypto on Tor https://trac.torproject.org/projects/tor/wiki/org/projects/Tor/MultithreadedCrypto [BrodalQueue] Gerth Stølting Brodal (1996) - Worst-Case Efficient Priority Queues https://users.info.unicaen.fr/~karczma/TEACH/Doc/brodal_okasaki.pdf