There are early plans to distribute crypto operations across multiple cores [https://trac.torproject.org/projects/tor/ticket/1749], but there might be a better way.
(I registered, but I couldn't find a way to annotate the ticket, so I'm emailing for now)
The ticket states the reason being to saturate the bandwidth available (by using all the cores as efficiently as possible).
I don't understand why a relay needs to have a "main thread". Network traffic arrives as an async operation and can be sent back out asynchronously. So a final strategy shouldn't have a central thread. The main thread might still be needed for startup, runtime adjustment, and system upkeep, but not for the core network-crypto processing; that should never need to touch the main thread.
The current proposal speaks about multi-threading crypto operations, let's call that "A) Speed - Speeding up processing of a single cell". Instead, I propose "B) Concurrency - Restructuring so multiple cells can be processed concurrently".
A cell of data should arrive via IO-Completion thread on a random CPU core, have crypto transformation applied on the same one core, then be dispatched onward out via the network. This seems to be quite a simple approach where I would think crypto code can remain the same "single-threaded" implementation.
Approach [A] will have diminishing returns as the number of cores increases. You can only break up a cell unit of work so much until you're encrypting one byte per cpu core. However, with approach [B], if you have millions of CPU cores (as an extreme) you can be processing millions of cells concurrently. Therefore, I believe approach [B] would be more scalable.
What do you think?
Regards,
On Wed, Jan 09, 2019 at 08:42:18PM +1100, Todd Hubers wrote:
There are early plans to distribute crypto operations across multiple cores [https://trac.torproject.org/projects/tor/ticket/1749], but there might be a better way.
(I registered, but I couldn't find a way to annotate the ticket, so I'm emailing for now)
The ticket states the reason being to saturate the bandwidth available (by using all the cores as efficiently as possible).
I don't understand why a relay needs to have a "main thread". Network traffic arrives as an async operation and can be sent back out asynchronously. So a final strategy shouldn't have a central thread. The main thread might still be needed for startup, runtime adjustment, and system upkeep, but not for the core network-crypto processing; that should never need to touch the main thread.
The current proposal speaks about multi-threading crypto operations, let's call that "A) Speed - Speeding up processing of a single cell". Instead, I propose "B) Concurrency - Restructuring so multiple cells can be processed concurrently".
A cell of data should arrive via IO-Completion thread on a random CPU core, have crypto transformation applied on the same one core, then be dispatched onward out via the network. This seems to be quite a simple approach where I would think crypto code can remain the same "single-threaded" implementation.
Approach [A] will have diminishing returns as the number of cores increases. You can only break up a cell unit of work so much until you're encrypting one byte per cpu core. However, with approach [B], if you have millions of CPU cores (as an extreme) you can be processing millions of cells concurrently. Therefore, I believe approach [B] would be more scalable.
What do you think?
You'll have troubles if cells *on the same circuit* try to be processed in parallel on different cores, at least with the current circuit-level crypto. But, once circuits are established, handing each circuit to a different thread/core (or more clever worker structure) is something that I think at least boradly makes sense, and indeed I have been proposing to have my students work on.
On Wed, Jan 09, 2019 at 08:17:15AM -0500, Ian Goldberg wrote:
On Wed, Jan 09, 2019 at 08:42:18PM +1100, Todd Hubers wrote:
There are early plans to distribute crypto operations across multiple cores [https://trac.torproject.org/projects/tor/ticket/1749], but there might be a better way.
(I registered, but I couldn't find a way to annotate the ticket, so I'm emailing for now)
The ticket states the reason being to saturate the bandwidth available (by using all the cores as efficiently as possible).
I don't understand why a relay needs to have a "main thread". Network traffic arrives as an async operation and can be sent back out asynchronously. So a final strategy shouldn't have a central thread. The main thread might still be needed for startup, runtime adjustment, and system upkeep, but not for the core network-crypto processing; that should never need to touch the main thread.
The current proposal speaks about multi-threading crypto operations, let's call that "A) Speed - Speeding up processing of a single cell". Instead, I propose "B) Concurrency - Restructuring so multiple cells can be processed concurrently".
A cell of data should arrive via IO-Completion thread on a random CPU core, have crypto transformation applied on the same one core, then be dispatched onward out via the network. This seems to be quite a simple approach where I would think crypto code can remain the same "single-threaded" implementation.
Approach [A] will have diminishing returns as the number of cores increases. You can only break up a cell unit of work so much until you're encrypting one byte per cpu core. However, with approach [B], if you have millions of CPU cores (as an extreme) you can be processing millions of cells concurrently. Therefore, I believe approach [B] would be more scalable.
What do you think?
You'll have troubles if cells *on the same circuit* try to be processed in parallel on different cores, at least with the current circuit-level crypto. But, once circuits are established, handing each circuit to a different thread/core (or more clever worker structure) is something that I think at least boradly makes sense, and indeed I have been proposing to have my students work on.
(Of course, this only is even relevant for the very highest-bandwidth nodes; my own node, for example, running on 5-year-old hardware with no special configuration, was pushing 400 Mbps last month, with one core at 80%, one at 11%, one at 6%, and the rest trivially small.)
Understood and agreed. I suspected there would be circuit-state to maintain. As you say, concurrent cells on the same circuit should be queued or thread-locked. I suspect thread-locking will be simple enough - the best approach.
And given it's only a problem for the biggest nodes, a design should be chosen that is efficient and focuses on achieving the goals of such users.
I believe this is that efficient and focused design.
On Thu, 10 Jan 2019 at 00:54, Ian Goldberg iang@cs.uwaterloo.ca wrote:
On Wed, Jan 09, 2019 at 08:17:15AM -0500, Ian Goldberg wrote:
On Wed, Jan 09, 2019 at 08:42:18PM +1100, Todd Hubers wrote:
There are early plans to distribute crypto operations across multiple
cores
[https://trac.torproject.org/projects/tor/ticket/1749], but there
might be
a better way.
(I registered, but I couldn't find a way to annotate the ticket, so I'm emailing for now)
The ticket states the reason being to saturate the bandwidth available
(by
using all the cores as efficiently as possible).
I don't understand why a relay needs to have a "main thread". Network traffic arrives as an async operation and can be sent back out asynchronously. So a final strategy shouldn't have a central thread.
The
main thread might still be needed for startup, runtime adjustment, and system upkeep, but not for the core network-crypto processing; that
should
never need to touch the main thread.
The current proposal speaks about multi-threading crypto operations,
let's
call that "A) Speed - Speeding up processing of a single cell".
Instead, I
propose "B) Concurrency - Restructuring so multiple cells can be
processed
concurrently".
A cell of data should arrive via IO-Completion thread on a random CPU
core,
have crypto transformation applied on the same one core, then be
dispatched
onward out via the network. This seems to be quite a simple approach
where
I would think crypto code can remain the same "single-threaded" implementation.
Approach [A] will have diminishing returns as the number of cores increases. You can only break up a cell unit of work so much until
you're
encrypting one byte per cpu core. However, with approach [B], if you
have
millions of CPU cores (as an extreme) you can be processing millions of cells concurrently. Therefore, I believe approach [B] would be more scalable.
What do you think?
You'll have troubles if cells *on the same circuit* try to be processed in parallel on different cores, at least with the current circuit-level crypto. But, once circuits are established, handing each circuit to a different thread/core (or more clever worker structure) is something that I think at least boradly makes sense, and indeed I have been proposing to have my students work on.
(Of course, this only is even relevant for the very highest-bandwidth nodes; my own node, for example, running on 5-year-old hardware with no special configuration, was pushing 400 Mbps last month, with one core at 80%, one at 11%, one at 6%, and the rest trivially small.) -- Ian Goldberg Professor and University Research Chair Cheriton School of Computer Science University of Waterloo _______________________________________________ tor-dev mailing list tor-dev@lists.torproject.org https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-dev