Proposal 168: Reduce default circuit window

Tue Sep 1 11:50:45 UTC 2009

Hi all,

this sounds like an interesting proposal! However, I'd like to point you to 
some implications and side effects. Over the past months, one of my students 
(Daniel Marks, on CC) has worked on congestion control in Tor. One of his 
results is an implementation of Tor circuits in a network simulation framework 
(ns-2). We use this implementation to simulate Tor networks of different size 
and with different traffic patterns, play with various parameters and measure 
throughput, cell latencies, etc. Currently, our aim is to gain a better 
understanding of congestion effects in Tor and their implications and 
interrelations.

Over the weekend, Daniel has run some simulations with the proposed reduction 
of the window size. They are certainly still a bit preliminary, but 
nevertheless some trends are clear.

> 1. Overview
>
>   We should reduce the starting circuit "package window" from 1000 to
>   101. The lower package window will mean that clients will only be able
>   to receive 101 cells (~50KB) on a circuit before they need to send a
>   'sendme' acknowledgement cell to request 100 more.
>
>   Starting with a lower package window on exit relays should save on
>   buffer sizes (and thus memory requirements for the exit relay), and
>   should save on queue sizes (and thus latency for users).

A central problem currently is the huge queues that build up in the onion 
routers. A reduction of the window size will certainly result in shorter 
queues. Whether these translate into lower latencies depends on the definition 
of "latency", however. If "latency" is the time that a given cell spends 
within the Tor overlay, certainly yes. If "latency", however, is the time that 
it takes to download, e.g., a certain web page, this does not necessarily hold 
true (see below).

>   Lowering the package window will induce an extra round-trip for every
>   additional 50298 bytes of the circuit. This extra step is clearly a
>   slow-down for large streams, but ultimately we hope that a) clients
>   fetching smaller streams will see better response, and b) slowing
>   down the large streams in this way will produce lower e2e latencies,
>   so the round-trips won't be so bad.

As far as we see, the current Tor design largely decouples the queues of 
individual circuits (which is good!). If some given circuit's queue in an 
onion router is short, then cells from this circuit will experience 
low queueing delays, even if the queues of other circuits in the same router 
are long - the round robin scheduler will pick cells from the short queue for 
transmission not long after they arrived. (The outgoing TCP buffer also has a 
certain impact here, but the by far largest fraction of the queueing delay is 
apparently within Tor itself.) This also implies: slowing down the other 
streams does not necessarily and automatically help to achieve shorter 
download times!

> 2. Motivation
>
>   Karsten's torperf graphs show that the median download time for a 50KB
>   file over Tor in mid 2009 is 7.7 seconds, whereas the median download
>   time for 1MB and 5MB are around 50s and 150s respectively. The 7.7
>   second figure is way too high, whereas the 50s and 150s figures are
>   surprisingly low.

Your statement implies that you are looking at something like the 
"transmission latency per KB" (which is, by the way, identical to the inverse 
throughput). This metric can be treacherous. To illustrate my point, imagine a 
(very) long communication link, e.g., going from the earth to the moon 
(ok...). Assume it has enormously large available bandwidth, but also a very 
high link latency (say, 7.7 seconds ;-) ). If you send a file of 50 KB over 
that link, it this will never arrive after less than 7.7 seconds. If you send 
a 5 MB file (without waiting for ACKs), it will likewise arrive after 7.7 
seconds - so you get a much lower "transmission latency per KB of data" (= 
bandwidth!). Artificially limiting the 5 MB transmissions' bandwidth will not 
help the 50 KB transmissions to deliver their data any faster.

This may be a bit (ok, not just a bit) simplistic, but it underlines that 
throughput and latency are not independent and can only to a certain extent be 
optimized individually or traded off against each other. And if there are 
multiple interfering circuits, things become really interesting and often 
behave differently than one would initially expect.

The key question here is what the 7.7 seconds are spent for; they are a 
symptom, not the cause. Our current impression is that the key problem in Tor 
is not as much the maximum window size as the fact that the window is opened 
in huge chunks. This makes the traffic extremely bursty and results in 
unnecessarily long queues. If you just reduce the maximum window size to 101 
but still release 100 cells at once into the overlay, this burstiness will 
remain (or it will become even worse).

>   The median round-trip latency appears to be around 2s, with 25% of
>   the data points taking more than 5s. That's a lot of variance.

That's also what we see in the simulations. However, it appears that the 
variance gets worse if you reduce the window size. The reason is that a "stop-
and-go" traffic pattern emerges: after 101 cells, the source must wait for the 
sendme; during that time, the circuit becomes completely empty. Potentially 
free bandwidth at the routers is not used, while readily prepared cells must 
wait for a long time in the circuit's source. After the sendme arrives, the 
first cells make it through the overlay very quickly. The differences between 
individual cells' latencies are therefore huge.

>   We designed Tor originally with the original goal of maximizing
>   throughput. We figured that would also optimize other network properties
>   like round-trip latency. Looks like we were wrong.

Well - cutting down the bandwidth (at the risk of underutilizing resources in 
the Tor overlay) does not automatically improve other performance aspects. I 
have a feeling this could turn out to be a case of "out of the frying pan into 
the fire".

> (...)
>
> 3.3. What about variable circuit windows?
>
>  Once upon a time we imagined adapting the circuit package window to
>  the network conditions. That is, we would start the window small,
>  and raise it based on the latency and throughput we see.
>
>  In theory that crude imitation of TCP's windowing system would allow
>  us to adapt to fill the network better. In practice, I think we want
>  to stick with the small window and never raise it. The low cap reduces
>  the total throughput you can get from Tor for a given circuit. But
>  that's a feature, not a bug.

Note that a hard window size limit does not put a cap on the throughput, but 
on the bandwidth-delay product. Then, circuits with longer RTTs would 
experience a dramatic breakdown in throughput, because they can only send ~50 
KB and must then wait for the sendme to arrive. Circuits with low RTT are much 
less affected. Applications will also suffer from extremely bursty traffic ("stop-
and-go"), probably worse than in the current Tor overlay. This is exactly what 
we observe in our simulations.

(BTW: Could this have implications on security? From the maximum throughput of 
a circuit, or from the burst pattern, it might be possible to infer 
information on the circuit's length/delay, or on an onion router's position 
within the circuit.)

> 4. Evaluation
>
>   How do we know this change is actually smart? It seems intuitive that
>   it's helpful, and some smart systems people have agreed that it's
>   a good idea (or said another way, they were shocked at how big the
>   default package window was before).
>
>   To get a more concrete sense of the benefit, though, Karsten has been
>   running torperf side-by-side on exit relays with the old package window
>   vs the new one. The results are mixed currently -- it is slightly faster
>   for fetching 40KB files, and slightly slower for fetching 50KB files.

Note that file sizes of 40 or 50 KB cannot tell you much about such a 
modification. As you state above, 100 cells is equivalent to ~50 KB of payload. 
I.e., you do the whole transmission within the initial transfer window. So, 
the data transfer will be over before the circuit's window mechanisms have had 
a chance to become active at all! Consequently, modifications to the window 
mechanism will not impact transfers that are below that threshold directly.

>   I think it's going to be tough to get a clear conclusion that this is
>   a good design just by comparing one exit relay running the patch. The
>   trouble is that the other hops in the circuits are still getting bogged
>   down by other clients introducing too much traffic into the network.

Maybe I'm getting something royally wrong - but what exactly are the 
mechanisms within an onion router that would result in one circuit 
experiencing much longer cell queuing delays just because *another* circuit 
has a long queue in that router? In terms of fair resource sharing between 
circuits (and the opportunity to forward cells quickly), our impression is 
that the round-robin scheduler does its job pretty well. It will be hard to 
improve on that just by introducing intentional resource underutilization.
But you guys know the Tor sources much better than we do - are we missing 
something important here?

Cheers

Björn

-- 
Jun.-Prof. Dr. Björn Scheuermann
Mobile and Decentralized Networks Group
Heinrich Heine University
Universitätsstr. 1, D-40225 Düsseldorf, Germany

Building 25.12, Room 02.42
Tel: +49 211 81 11692
Fax: +49 211 81 11638 
scheuermann at cs.uni-duesseldorf.de
http://www.cn.uni-duesseldorf.de