onion_router stress-testing...

Roger Dingledine arma at mit.edu
Mon Nov 25 06:59:33 UTC 2002

On Fri, Nov 22, 2002 at 03:13:59PM -0500, Bruce Montrose wrote:
> test#2 was only slightly worse than test#1 probably because privoxy does 
> data scrubbing and httpap did not.

httpap is obsolete (I've stopped maintaining it, and I know there were
bugs there when I abandoned it). So when doing timing tests you should use
privoxy -- but first go to http://config.privoxy.org/toggle?set=disable
to turn off all the data scrubbing.

> Both test#1 and test#2 perform horribly under great stress.

Yes, this makes sense. The problem is that we do too many public key
ops when we're being flooded with creates. That is, we use almost all of
each second processing create cells, so we don't process data cells very
often. Our congestion control algorithm makes it even worse, because
each circuit can push at most 1200 bytes before needing a sendme cell
to work its way back through the circuit.

I've raised the receive windows by a factor of 10, so now we can send
12000 bytes between sendmes. It won't help much, though (I don't think
it will hurt either), because nodes are simply too starved for time to
get to all of the data cells.

> Can you think of any instrumentation of the onion_router code that might be 
> used to give us information that might shed some light on what the hold up 
> is? I could rerun test #2 with this instrumentation in place to gather the data.

The 'holdup' is not going to be completed solved -- each 1024 bit RSA
decrypt takes 6-7ms on a 1GHz Athlon machine. So best-case we're looking
at a max of ~150 new circuits each second on such a machine. Onion
routing will never compare to straight web surfing in terms of number
of new connections we can handle each second.

Can you give me the details about the experiments you're running? How
many nodes are there, what sort of processors? Are there machines with
more than one node? How large are the files you're pulling with each
request? Can you do experiments where you ramp up file size or something
and have only a few new requests each second?

Right now I've got 4 nodes running on moria, and from playing with
httperf, it seems it can handle between 10 and 15 requests per second
sustained. Each request is processed by 2 or 3 nodes. Each onion takes
between 20 and 80ms to process.

I've implemented an "OnionsPerSecond" entry for config files. It currently
defaults to 50. I imagine if you tune this value well (e.g. try 5 to
start, and then see if you can raise it), the tor network should be
able to handle the increasing load from your tests. Performance will
stay about the same rather than going to nothing as load increases.

I also made nodes output a per-second summary of how long it took
processing each class of cell. So when I do a wget of a 500KB file,
output from one of the nodes (run at -l info) might look like this:

Nov 25 00:35:08.183 [info] Starting to process create cell.
Nov 25 00:35:08.202 [info] connection_twin_get_by_addr_port(): Found exact match.
Nov 25 00:35:08.202 [info] command_time_process_cell(): That call just took 18 ms.
Nov 25 00:35:08.229 [info] connection_send_connected(): passing back cell (aci 57523).
Nov 25 00:35:09.000 [info] At end of second:
Nov 25 00:35:09.000 [info] Create:    2 (18 ms)
Nov 25 00:35:09.000 [info] Data:      1631 (70 ms)
Nov 25 00:35:09.000 [info] Destroy:   0 (0 ms)
Nov 25 00:35:09.000 [info] Sendme:    7 (0 ms)
Nov 25 00:35:09.000 [info] Connected: 1 (0 ms)
Nov 25 00:35:10.039 [info] At end of second:
Nov 25 00:35:10.039 [info] Create:    0 (0 ms)
Nov 25 00:35:10.039 [info] Data:      1173 (50 ms)
Nov 25 00:35:10.039 [info] Destroy:   0 (0 ms)
Nov 25 00:35:10.039 [info] Sendme:    11 (0 ms)
Nov 25 00:35:10.039 [info] Connected: 0 (0 ms)
Nov 25 00:35:11.048 [info] At end of second:
Nov 25 00:35:11.048 [info] Create:    0 (0 ms)
Nov 25 00:35:11.048 [info] Data:      600 (25 ms)
Nov 25 00:35:11.048 [info] Destroy:   0 (0 ms)
Nov 25 00:35:11.049 [info] Sendme:    6 (0 ms)
Nov 25 00:35:11.049 [info] Connected: 0 (0 ms)
Nov 25 00:35:11.992 [info] connection_send_destroy(): Sending destroy (aci 57523).
Nov 25 00:35:12.057 [info] At end of second:
Nov 25 00:35:12.057 [info] Create:    0 (0 ms)
Nov 25 00:35:12.057 [info] Data:      1092 (47 ms)
Nov 25 00:35:12.057 [info] Destroy:   1 (0 ms)
Nov 25 00:35:12.057 [info] Sendme:    11 (0 ms)
Nov 25 00:35:12.057 [info] Connected: 0 (0 ms)

When I hammer the network, then we see something like:

Nov 25 00:39:02.042 [info] Create:    2 (18 ms)
Nov 25 00:39:02.042 [info] Data:      2 (0 ms)
Nov 25 00:39:02.042 [info] Destroy:   0 (0 ms)
Nov 25 00:39:02.042 [info] Sendme:    0 (0 ms)
Nov 25 00:39:02.042 [info] Connected: 0 (0 ms)
Nov 25 00:39:03.139 [info] Create:    22 (984 ms)
Nov 25 00:39:03.139 [info] Data:      19 (33 ms)
Nov 25 00:39:03.140 [info] Destroy:   0 (0 ms)
Nov 25 00:39:03.140 [info] Sendme:    0 (0 ms)
Nov 25 00:39:03.140 [info] Connected: 1 (0 ms)
Nov 25 00:39:04.072 [info] Create:    24 (917 ms)
Nov 25 00:39:04.072 [info] Data:      50 (7 ms)
Nov 25 00:39:04.072 [info] Destroy:   1 (0 ms)
Nov 25 00:39:04.072 [info] Sendme:    0 (0 ms)
Nov 25 00:39:04.072 [info] Connected: 4 (0 ms)
Nov 25 00:39:05.031 [info] Create:    30 (596 ms)
Nov 25 00:39:05.031 [info] Data:      78 (13 ms)
Nov 25 00:39:05.031 [info] Destroy:   0 (0 ms)
Nov 25 00:39:05.031 [info] Sendme:    0 (0 ms)
Nov 25 00:39:05.031 [info] Connected: 11 (1 ms)

I think onions (2 create cells compromise one onion) *should* take about
20ms to process on moria (at least, when it's got 3-4 nodes running
in parallel). But when there are quite a few of them they seem to take
longer. Some reasons for this:

* Exit connections take longer to process. I think this is because
it needs to process the onion and also do a gethostbyname and then
a connect. I'm going to do some more tests to find out how much of the
delay is caused by gethostbyname. Eventually we need to separate out the
dns resolve into its own process; but I'm still trying to avoid that as
long as I can.

* There are some functions, such as get_unique_aci_by_addr_port, that are
inefficient. Its basic algorithm is "pick a random 1 byte aci, see if it's
used; if so repeat". So when there are many circuits up, it repeats a lot
before finding a free one. Most of the time it uses far less than 1ms,
but at one point I saw it use 40ms. Clearly 1 byte isn't enough here. It's
also not clear we need them to be *random*... need to think more. Probably
there are more such functions too.

What other statistics would it be useful to have?


More information about the tor-dev mailing list