[tor-dev] Towards Congestion Control Deployment in Tor-like Networks

Tue Jun 2 21:52:36 UTC 2020

This novelette-length mail is meant to provide the background
information necessary to understand choices in congestion control
deployment for Tor. Much of this background information won't be
included in my planned Tor Proposal, which will hopefully be much
shorter, but the choices made there will be informed by this material.

This material has been enhanced through review and conversations with
Toke Høiland-Jørgensen, g burdell, Rob Jansen, and George Kadianakis.
Immense credit goes to Toke in particular, for detailed and thoughtful
review, and input on queue management.

Like my last mail on this topic, this mail comes with a playlist. Fair
warning: it is not exactly easy listening. Still, hopefully it gives you
strength and inspiration in these trying times:

https://people.torproject.org/~mikeperry/borders/everybody-loves-them/TakeAKneeAndMeditateInThePaint.txt

Section -I: Recap [RECAP]

  https://wizardzines.com/zines/networking

I strongly believe that congestion control will provide the single
largest performance win for the Tor network. Specifically, it will
considerably improve both latency and throughput metrics in the best,
average, and worst case. In addition to performance, congestion control
will also provide the second-largest scalability win on our research
horizon (behind Walking Onions), by making the queue length memory
requirements of Tor relays independent of the number of circuits carried
on that relay.

I also believe we can achieve the vast majority of the benefits by
deploying an RTT-based system that only clients and Exits need to
support, specifically BOOTLEG_RTT_TOR. After this, we might consider
more advanced options based on ECN, but their benefits are far less
clear, and fraught with peril.

This mail assumes that you have read "Options for Congestion Control in
Tor-like Networks":
  https://lists.torproject.org/pipermail/tor-dev/2020-January/014140.html

Even if you have already read the Options mail, I recommend reviewing
the the SENDME section and the sections marked with TCP_HISTORY,
BOOTLEG_RTT_TOR, BACKWARD_ECN_TOR, and perhaps also FORWARD_ECN_TOR, as
understanding this mail requires an understanding of those sections. The
summary of the other sections from the Options mail is basically this:
adding ECN signals to Tor seems straight-forward, but is actually very
tricky.

Section @: Introduction and summary [INTRO_SUMMARY]

Deep understanding of the congestion control problem space is essential
because shitty congestion control will not be a huge win, and will
reduce the benefit of other performance improvements. Worse still, it
may introduce side channels and other anonymity attacks.

In particular, our level of congestion latency (and associated relay
queue lengths) will be very sensitive to congestion signal response
time, to our congestion window update rules, and also to our choice of
queue management algorithm (ie: which circuits send cells and/or
congestion signals).

Our average circuit bandwidth on the network will similarly be very
sensitive to the ability of the congestion window to quickly update and
converge on the available Bandwidth-Delay Product (BDP) of its path.

In other words, we need to make good design choices in the following
three different areas:
    I. Which congestion signal to use [CONGESTION_SIGNALS]
   II. How to update congestion window [CONTROL_ALGORITHMS]
  III. What circuits to send the signal on [QUEUE_MANAGEMENT]

The next three sections of this mail correspond to those three areas.

As before, I have tagged the specific sub-sections that will be used in
the Tor Proposal with [PROPOSAL_CANDIDATE]. I have also given tags to
section in this document. References to things in this document are
enclosed in brackets.  References to the Options mail are simply all
caps, without brackets.

The high-level plan is that we use RTT as a primary congestion signal,
as per BOOTLEG_RTT_TOR from the Options mail, and try it out with a few
different congestion algorithms, such as [TCP_VEGAS_TOR] and
[TCP_WESTWOOD_TOR].

It turns out that when RTT is used as a congestion signal, our current
Circuit-EWMA is sufficient to serve as [QUEUE_MANAGEMENT]. It also turns
out that Circuit-EWMA actually should protect us from fairness issues in
the control algorithms, and from cheaters who try to run different
congestion control algorithms. Circuit-EWMA will also deliver RTT
congestion signals more heavily to bulk circuits (because it prioritizes
their cells below light circuits), causing them to back off in favor of
lighter traffic, even as we relay fewer of their cells. Thanks goes to
Toke Høiland-Jørgensen for pointing this out.

The combination of RTT-based congestion signaling, a decent but simple
control algorithm, and Circuit-EWMA will get us the most if not all of
the benefits we seek, and only requires clients and Exits to upgrade to
use it. Once it is deployed, circuit bandwidth caps will no longer be
capped at ~500kb/sec by the fixed window sizes of SENDME; queue latency
will fall significantly; memory requirements at relays should plummet;
and bottlenecks in the network should dissipate.

However, we can still improve upon this further, after this initial
deployment. Because the "slow start" phase of TCP can take a long time,
we will also want to play around with various initial window sizes,
window growth rate parameters, and experiment with using a single
BACKWARD_ECN_TOR cell_t.command ECN-style signal. This signal will be
sent on circuits as determined by [CODEL_TOR]. An instance of
[CODEL_TOR] can be added to each TLS connection, after Circuit-EWMA is
applied.

While BACKWARD_ECN_TOR is complex and slow to roll out (all relays must
support this new cell_t.command), this optimization will likely be worth
it because BACKWARD_ECN_TOR can send a signal on a circuit to stop the
exponential growth of slow start in *less than* RTT/2 time, which is
significantly faster than implicitly measuring an RTT-based signal
(which takes RTT plus some measurement lag to act upon). This faster
signal means faster termination of slow start, which means less buffer
bloat.

For stream flow control, on the client side, we should just ack all
streams cells with a circuit-level SENDME even though they were not
delivered to their blocked stream. The local Tor client should keep
queuing them until the application reads them. Thus, a blocked client
application stream will not block other streams, but will cause Tor
client stream-level queues to build up. For exit-side streams, we should
*not* send SENDMEs until the cell has been sent on its exiting upstream
connection. This allows TCP pushback on one stream to block the entire
circuit, but this is desirable to avoid queuing, and it will only happen
in the upload direction.

Section I: Selecting Congestion Signals [CONGESTION_SIGNALS]

The "Options for Congestion Control in Tor-like Networks" mail probably
should have been titled "Options for Congestion Signals in Tor-like
Networks", as its primary focus was on the ways Tor endpoints can
receive a signal about the congestion status of a circuit on bottleneck
relays. In that post, I postponed discussion of queue management, and
deliberately under-specified how to react to a congestion signal (going
no further than saying "use SlowStart and AIMD").

I chose that scoping because the most difficult barrier we have faced in
the past decade of work on this topic is the side channels enabled by a
congestion signal (and information leaks in general).

However, I should have made our constraints more explicit. Our choice of
acceptable congestion signal mechanism is restricted by these design
constraints: 1. No side channels between Guard and Exit, or between
circuits 2. Incremental upgrade path 3. Fast response time 4. Cheater
prevention

These design constraints restrict our choice of congestion signal, but
they are not our only design constraints in this problem space. When we
get to the later sections on window control update algorithm, and queue
management algorithm, I will explicitly enumerate additional design
constraints there.  Hopefully this will provide more clarity than the
uncategorized list of "causes of death" in the Options post.

For your convenience, here's the summary table of all the options for
congestion signal again:

______________________________________________________________________
|        MIX           | CHEATING  | RESPONSE|     SIDE   |  UPGRADE |
|       TRACK          | DETECTION |   TIME  |   CHANNELS |   PATH?  |
|~~~~~~~~~~~~~~~~~~~~~~|~~~~~~~~~~~|~~~~~~~~~|~~~~~~~~~~~~|~~~~~~~~~~|
|BOOTLEG_RTT_TOR       | Exit RTT  |100c+RTT |     RTT    |   Exits  |
|FORWARD_ECN_TOR       | Circ OOM  |   RTT   |    None?   | Full Net |
|FORWARD_ECN_RTT_TOR   | Exit RTT  |   RTT   |     RTT    |Exits;Full|
|BACKWARD_ECN_TOR      |Middles;OOM|  RTT/2  | Guard->Exit| Full Net |
|BACKWARD_ECN_RTT_TOR  |Middles;RTT|RTT/2;RTT| Guard->Exit|Exits;Full|
|BACKWARD_ECN_TRAP_TOR |  Middles  |  RTT/2  |Guard<->Exit| Full Net |
|EDGE_FORWARD_ECN_TOR  |Middles;OOM| 2*RTT/3 |    None?   | Full Net |
|START_BACKWARD_ECN_TOR|Middles;OOM|RTT/2;RTT|  Low/None? | Full Net |
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

My short-term favorite is BOOTLEG_RTT_TOR, because it only requires
Exit relay support to use.

Recall that BOOTLEG_RTT_TOR defines an RTT latency threshold as its
congestion signal, and was first described in section 4.1 of the N23
Defenestrator paper.  Even though the N23 paper didn't bother to name
this system, RTT use as a congestion signal in this way also resembles
TCP Vegas, as we shall see.
  https://www.cypherpunks.ca/~iang/pubs/defenestrator.pdf

Note that because Circuit-EWMA will add additional delay to loud
circuits, "cheaters" who use alternate congestion control algorithms
should end up with more delay than those who do not. This means that if
we use RTT as a congestion signal, Circuit-EWMA should provide *more*
RTT congestion signal to cheaters, and cheaters would only be punishing
themselves with more Circuit-EWMA delay. So cheater resilience for an
RTT signal is actually more well handled that I previously thought.
Again, thanks goes to Toke Høiland-Jørgensen for pointing this out.

For completeness, we will still explore some examples of alternate
"cheating" control algorithms in Section II, though.

Unfortunately, RTT is itself a side channel, as per:
  https://www.freehaven.net/anonbib/cache/ccs07-latency-leak.pdf
  https://www.robgjansen.com/publications/howlow-pets2013.pdf

There is also literature on shaping circuit bandwidth to create a side
channel. This can be done regardless of the use of congestion control,
and is not an argument against using congestion control. In fact, the
Backlit defense may be an argument in favor of endpoints monitoring
circuit bandwidth and latency more closely, as a defense:
  https://www.freehaven.net/anonbib/cache/ndss09-rainbow.pdf
  https://www.freehaven.net/anonbib/cache/ndss11-swirl.pdf
  https://www.freehaven.net/anonbib/cache/acsac11-backlit.pdf

Recall that BACKWARD_ECN_TOR used circuit-level cell_t.command to signal
congestion. This allows all relays in the path to signal congestion in
under RTT/2 in either direction, and it can be flipped on existing relay
cells already in transit, without introducing any overhead. However,
because cell_t.command is visible and malleable to all relays, it can
also be used as a side channel. So we must limit its use to a couple of
cells per circuit, at most.

Recall that FORWARD_ECN_TOR adds encrypted end-to-end relay cells that
explicitly signal congestion to either the client or the Exit, but
requires the client to mediate and echo this cell to protect against
side channels. Because congestion signals must be relayed off the
client, it still takes the Exit a full RTT to learn about congestion at
earlier nodes in the circuit.  Additionally, because this cell itself
must travel the full path in each direction, it is further delayed by
congestion on the path, and it also adds yet more congestion to
congested paths. Congestion in Exit to client direction is the most
common, as this is the download direction. It will be particularly
expensive and slow to bounce the congestion signal off the client to the
Exit.

For this reason, I think all forms of FORWARD_ECN_TOR are not likely to
be as efficient as RTT signaling, despite the FORWARD_ECN_RTT_TOR
variant initially being flagged as a PROPOSAL_CANDIDATE in the Options
mail. However, we should still keep it in mind, since RTT may turn out
to be an unreliable signal by itself.

Therefore, since RTT is already measurable with our current SENDME flow
control, I propose that we use it via BOOTLEG_RTT_TOR (or TCP Vegas),
and then add START_BACKWARD_ECN_TOR only exactly once (to exit slow
start). If RTT proves unreliable, then we can experiment with the more
heavyweight FORWARD_ECN_TOR full-cell signal.

As we shall see in the following control algorithm section, RTT-based
systems like TCP Vegas largely avoid the need to send any congestion
signal (drop or ECN). In other words: if we can rely on RTT, we should
not need any explicit signals at all (though they would improve
responsiveness).

Section II: Congestion Window Control Algorithms [CONTROL_ALGORITHMS]

Recall that the congestion window is the number of packets TCP will
allow to be sent on a connection without any acknowledgment. The
congestion window is meant to match the bandwidth-delay product of the
connection as close as possible. The bandwidth-delay product can be
thought of as the total amount of data that can be in-transit on a link
at one time without using any queue memory (bytes per second of
bandwidth multiplied by seconds of transit time delay = total bytes in
transit on the wire at one time).

On the Internet, if the congestion window exceeds the bandwidth-delay
product of a transmission path, then packets must build up in router
queues, and eventually get dropped.  If the congestion window is below
the bandwidth-delay product of the path, then all routers on the path
spend time unused and idle, and thus utilization suffers, even if users
are trying to use full bandwidth.

TCP variants basically compete on their ability to collectively track
the bandwidth-delay product of a path under a wider of a variety of
network conditions and other competing traffic.

In the interest of brevity, my "Options for Congestion Control in
Tor-like Networks" mail suggested that we update the congestion window
according to Slow Start with AIMD, in response to a congestion signal
(or lack thereof).

However, that is not the only option, and if we retain the ability to
measure RTT, it may not even be the most efficient one. So to help us
make this decision, I did a review of historical TCP and TCP-like
systems in the research literature, and will present the highlights
here.

But first, let's define the control requirements that we want to satisfy:
  1. Full bandwidth utilization of slowest TLS link in every circuit
  2. Queue length minimization at the relay sending on this TLS link
  3. Fairness between circuits (esp partially overlapping ones)
  4. Rapid convergence to full circuit capacity

These requirements mirror the requirements used in TCP evaluation. TCP
implementations compete over metrics in each of these areas, and
algorithm design is far more art than formally proved optimal or even
correct.

That does not mean that formalism is impossible, however. For a great
formal breakdown of this design space without all of TCP's baggage, see
Section 3 of the XCP paper. Note that we can't use XCP as-is because its
congestion signal fields provide way too much information for use in
potential side-channels, but it is worth a read because it clarifies the
control theory behind congestion control:
  http://conferences.sigcomm.org/sigcomm/2002/papers/xcp.pdf

In terms of how this design space is approached in TCP: utilization and
queue minimization are typically handled simultaneously - sometimes
explicitly and sometimes implicitly. Similarly, fairness is sometimes
explicitly designed for, but more often fairness is an emergent property
that is analyzed when competing against a benchmark (historically: TCP
Reno). Fairness abuse is also an area where cheating can occur that
enables cheaters to get more than their fair share of bandwidth, and
such cheating can be very hard to detect and prevent on the Internet.

Interestingly, many of the design goals of TCP are *also* improved via
queue management algorithms at routers (Section III of this mail). In
fact, our Circuit-EWMA circuit scheduler will effectively hand out RTT
congestion signals (delay) to loud circuits before quiet ones, and will
*also* enforce fairness and impede cheating, by delaying a cheater's
cells even more.

In this sense, we do not need to worry quite so much about fairness and
cheating as security properties, but a lack of fairness at the TCP
algorithm will *still* increase memory use in relays to queue these
unfair/loud circuits. So we should still be mindful of these properties
in selecting our congestion algorithm, to minimize relay memory use, if
nothing else.

For a good summary of several TCP congestion control algorithms, see:
  http://intronetworks.cs.luc.edu/1/html/reno.html ("Slow Start+AIMD")
  http://intronetworks.cs.luc.edu/1/html/newtcps.html (modern TCPs)

As you can see, the TCP design space is huge.

A good way to break down the TCP space is to separate the drop-based
algorithms from the delay-based algorithms. TCP drop-based algorithms
require either packet drops or ECN packet marking to signal congestion.
Delay based algorithms infer congestion from RTT measurements only.

But as we shall see, this distinction is artificial for us.
BOOTLEG_RTT_TOR effectively converts delay *into* an explicit congestion
signal, as far as congestion window control algorithms are concerned.
Similarly, either of our ECN signals can be used in place of the drop
signal in the algorithms below.

I'm going to review four options that strike me as most relevant to our
needs.  Two are drop-based: TCP Reno, and TCP CUBIC. Two are
delay-based: TCP Vegas, TCP Westwood.

It should be noted that there are also hybrid algorithms that use
drops/ECN in addition to delay as congestion signals, but since we are
focusing on primarily using RTT only as our signal, I won't be covering
them. For a good overview of those options, see:
  https://arxiv.org/pdf/1903.03852.pdf

Toke Høiland-Jørgensen suggested that we also investigate BBR, which is
a hybrid algorithm currently being evaluated by DropBox and Google.
However, BBR involves packet send time scheduling, which will require
replacement of Circuit-EWMA and other extensive changes. Furthermore,
BBR is still under active development, experimentation, and tuning. In
fact, while v2 has been tested by Dropbox, no IETF draft specification
exists yet. Still, for completeness:
  https://tools.ietf.org/html/draft-cardwell-iccrg-bbr-congestion-control-00
  https://queue.acm.org/detail.cfm?id=3022184
  https://datatracker.ietf.org/doc/slides-106-iccrg-update-on-bbrv2/

https://dropbox.tech/infrastructure/evaluating-bbrv2-on-the-dropbox-edge-network

I am also going to simplify and omit treatment of packet loss and
duplicate acks, since we will not be dropping any packets. Additionally,
I have normalized all formulas below to be in terms of Tor cells, rather
than bytes.

The hope is that these simplifications help us get to the essence of
what these algorithms would look like when deployed on Tor, without
getting distracted by irrelevant drop and ack synchronization management
details from TCP/IP.

However, as we move closer to specifying an actual design proposal, we
should double-check my simplifications below, for the algorithm(s) that
we actually specify.

Section II.A: TCP Reno [TCP_RENO_TOR]

  https://tools.ietf.org/html/rfc5681
  http://intronetworks.cs.luc.edu/1/html/reno.html

Until recently, TCP Reno was the most popular TCP on the Internet, and
it is still used as a fairness benchmark. While we won't be using it
directly, it is thus important to understand.

Like all TCPs, TCP Reno satisfies its utilization and queue minimization
design goals by trying to make its congestion window closely match the
bandwidth-delay product of the link. More importantly, when multiple TCP
connections are present competing for the bandwidth of a link, TCP Reno
fairly distributes the bandwidth-delay product of the link equally among
all competing TCP Reno connections.
  http://pbg.cs.illinois.edu/courses/cs598fa09/slides/05-Ashish.pdf
  https://www.cse.wustl.edu/~jain/papers/ftp/cong_av.pdf

TCP Reno probes the bandwidth-delay product by increasing its congestion
window until it overflows the the queue capacity of the slowest router
on a path, causing a packet drop or ECN signal, upon which TCP Reno
reduces its congestion window.

The initial phase of TCP Reno is called "Slow Start". For our purposes,
Slow Start in TCP Reno means that every SENDME ack, the congestion
window cwnd becomes:

   cwnd = cwnd + SENDME_RATE   (recall Tor's SENDME_RATE is 100 cells)

Because increasing the congestion window by 100 every SENDME allows the
sender to send 100 *more* 512 byte cells every 100 cells, this is an
exponential function that causes cwnd to double every cwnd cells. TCP
Reno keeps on doubling until it experiences a congestion signal.

Once a congestion signal is experienced, Slow Start is exited, and the
Additive-Increase-Multiplicative-Decrease (AIMD) steady-state phase
begins.  AIMD algorithms can be generalized with two parameters: the add
quantity, and the multiply quantity. This is written as AIMD(a,m). TCP
Reno uses AIMD(1, 0.5).  It should be noted that most later TCP's
include an AIMD component of some kind, but tend to use 0.7 for m
instead of 0.5. This allows them to recover faster from congestion.

After slow start exits, in steady-state, after every ack (SENDME)
response without a congestion signal, the window is updated as:

   cwnd = cwnd + SENDME_RATE/cwnd

This comes out to increasing cwnd by 1, every time cwnd cells are
successfully sent without a congestion signal occurring. Thus this is
additive linear growth, not exponential growth.

If there was a congestion signal, cwnd is updated as:

   cwnd  = cwnd * m   (m=0.5 or 0.7)

This is called "Fast Recovery". If you dig into the RFCs, actual TCP
Reno has some additional details for responding to multiple packet
losses that in some cases can fall back into slow-start. However, for
our purposes, the congestion window should only be updated once per cwnd
cells, in either the congestion case, or the non-congestion case.

TCP Reno (and all TCPs) also revert to slow start when the connection
has been idle, in case bottleneck capacity has changed in this time. We
may or may not want to do this, as slow start has a dramatic impact on
page load times.

TCP Reno is still used as the benchmark to determine fairness. If an
algorithm is "TCP Friendly", this basically means that when competing
with TCP Reno to use spare path capacity, it does not get its ass
kicked, and it does not kick Reno's ass. Several algorithms contain
hacks or special cases to preserve this property. If this fairness
property holds, a bunch of math can prove convergence rates,
utilization, congestion window sizes, and even queue sizes.  Modern
queue management also relies on this math to auto-tune queue parameters,
as we will see in Section III.
  http://pbg.cs.illinois.edu/courses/cs598fa09/slides/05-Ashish.pdf
  https://www.icir.org/tfrc/aimd.pdf

Section II.B: TCP Cubic [TCP_CUBIC_TOR]

  http://intronetworks.cs.luc.edu/1/html/newtcps.html#tcp-cubic
  https://pandorafms.com/blog/tcp-congestion-control/
  https://tools.ietf.org/html/rfc8312

TCP Reno unfortunately does not perform optimally for networks where the
natural packet loss rate starts to become frequent, which happens for
both high bandwidth and high latency networks. This shortcoming has lead
to a huge explosion of competition among replacements and improvements
for this use case, and TCP Cubic has won out (for now).

TCP Cubic basically takes TCP Reno and tweaks the additive portion of
AIMD behavior to be a cubic function (y=x^3) instead of linearly
additive. The cubic function was chosen because it is concave, then
flat, and then convex.  The concave piece allows for quickly recovering
after congestion (or non-congestion-related packet loss), and the convex
piece allows for gently probing for fresh spare link capacity.

Because of adoption by both the Linux kernel and the Windows kernel as
the default TCP, TCP Cubic is now the most popular TCP on the Internet.

Unfortunately, TCP Cubic is complicated. It has multiple magic numbers,
and is defined piece wise depending on how its congestion window would
compare to the equivalent TCP Reno window (to ensure the "TCP Friendly"
property -- without this hack, TCP Cubic is "too polite" and TCP Reno
eats its lunch). It additionally has time-based components that must be
clocked relative to current measured RTT (though it does not use RTT as
a congestion signal).

The cubic congestion window is defined as:

   W_max = previous window size before last congestion signal
   W_cubic(t) = C*(t-K)^3 + W_max

With magic number constants:
   C = 0.4
   beta_cubic = 0.7
   K = cubic_root(W_max*(1-beta_cubic)/C)

To maintain competitiveness with TCP Reno, Cubic keeps a running
estimate of an equivalent TCP Reno's window size, using the AIMD math
cited for Reno above:

    W_est(t) = W_max*beta_cubic +
                   [3*(1-beta_cubic)/(1+beta_cubic)] * (t/RTT)

    if W_est(t) > W_cubic(t):
        cwnd = W_est(t)
    else:
        cwnd = cwnd + (W_cubic(t+RTT) - cwnd)/cwnd

Upon congestion signal, Cubic uses its beta_cubic parameter (0.7), and
thus does not back off quite as far as Reno:

    W_max = cwnd
    cwnd = cwnd * beta_cubic

It is pretty easy to convince yourself that Cubic will do as least as
well as TCP Reno (thanks to the AIMD Reno window comparison hack), but
it is very hard to conceptualize the above and convince yourself that it
is the best we could possibly do for our use case. Especially since Tor
is *not* affected by an increasing inherent drop rate as bandwidth
increases, unlike the Internet.

For these reasons, the complexity of TCP Cubic does not strike me as
worthwhile, as compared to other, simpler candidates.

Section II.C: TCP Vegas and [TCP_VEGAS_TOR] [PROPOSAL_CANDIDATE]

  http://pages.cs.wisc.edu/~akella/CS740/F08/740-Papers/BOP94.pdf
  ftp://ftp.cs.princeton.edu/techreports/2000/628.pdf (lol ftp, sorry)

Similar to BOOTLEG_RTT_TOR from my Options mail, TCP Vegas was designed
to make use of RTT and bandwidth estimation to avoid queue congestion,
and thus avoid any dropped packets or ECN signals. However, TCP Vegas
actually directly estimates queue length of the bottleneck router, and
aims to minimize the packets in that queue.

To accomplish this, TCP Vegas maintains two RTT measurements:
RTT_current and RTT_min. It also maintains a bandwidth estimate.

The bandwidth estimate is the current congestion window size divided by
the RTT estimate:

   BWE = cwnd/RTT_current

(Note: In later TCP Vegas relatives like TCP Westwood, this bandwidth
estimate was improved to use an explicit count of the unacked packets in
transit, instead of cwnd).

Recall that in order to utilize the full capacity of the link, all TCPs
attempt to maintain its congestion window as close to the
bandwidth-delay product (ie: the "memory capacity") of the transmission
path as possible.

The extra queue use along a path can thus be estimated by first
estimating the path's bandwidth-delay product:

   BDP = BWE*RTT_min

TCP Vegas then estimates the queue use caused by congestion as:

   queue_use = cwnd - BDP
             = cwnd - cwnd*RTT_min/RTT_current
             = cwnd * (1 - RTT_min/RTT_current)

If queue_use drops below a threshold alpha (typically 2-3 packets for
TCP, but perhaps double or triple that for our smaller cells), then the
congestion window is increased. If queue_use exceeds a threshold beta
(typically 4-6 packets, but again we should probably double or triple
this), then the congestion window is decreased. This update happens only
once per cwnd packets:

   if queue_use < alpha:
     cwnd = cwnd + 1
   elif queue_use > beta:
     cwnd = cwnd - 1

As a backstop, TCP Vegas still will half its congestion window in the
event that a congestion signal (packet drop or ECN) still occurs.
However, such packet drops should not actually even happen, unless Vegas
is fighting with a less fair control algorithm that is causing drops
instead of backing off based on queue delay.

Vegas also uses this queue_use estimate to govern its initial slow start
behavior. During slow start, the Vegas congestion window grows
exponentially, but only half as fast as Reno (to avoid excessive
congestion). For us, this means that cwnd is updated every SENDME ack
by:

   if queue_use < gamma:
     cwnd = cwnd + SENDME_RATE/2

This exponential increase continues until the queue_use estimate exceeds
"gamma" (which is determined experimentally and sadly not listed in the
TCP Vegas paper). If I had to guess, gamma is probably close to beta
(4-6 packets).

Because Vegas tries much harder to avoid congestion than TCP Reno, Vegas
is vulnerable to cheating through fairness abuse. When TCP Vegas
competes against a TCP Reno-style TCP that responds only to packet drops
(or ECN signals in our case) and ignores RTT, TCP Vegas detects
congestion earlier than TCP Reno would, and ends up yielding bandwidth
to TCP Reno. This means that Vegas does *not* have the "TCP Friendly"
property that modern AQM algorithms depend on for their
parameterization.

Note that we should expect the same thing to happen if TCP Vegas
competed with BOOTLEG_RTT_TOR algorithm run by cheaters. However, since
RTT is being used as the congestion signal in both cases, Circuit-EWMA
should still balance this out, but we should experiment with this to
confirm. There may still be additional queuing memory costs in the
presence of cheaters, compared to either pure [TCP_VEGAS_TOR] or pure
BOOTLEG_RTT_TOR. Even so: this form of cheating is also pretty easy for
honest Exit nodes to detect, and cheating clients only get benefit in
the upload direction, even if they did cheat.

To see this a bit more concretely, if we sub in TCP Reno in
BOOTLEG_RTT_TOR, during slow start we have:

    T=(1−a)*rtt_min+a*rtt_max

    If rtt < T:
      cwnd = cwnd + SENDME
    else:
      exit slow start

And then for steady state:

    If rtt < T:
      cwnd = cwnd + SENDME_RATE/cwnd
    else:
      cwnd = cwnd * 0.5

The 'a' parameter from BOOTLEG_RTT_TOR is operating similarly to the
queue_len criterion of [TCP_VEGAS_TOR], but the window update rules are
[TCP_RENO_TOR], which are more aggressive than TCP Vegas.

So effectively, this is still [TCP_VEGAS_TOR] with a different target
condition for queue_len. This means we could expect BOOTLEG_RTT_TOR to
drive the congestion to some higher rate of queue use than
[TCP_VEGAS_TOR], and senders who use it may out-compete Vegas much like
TCP Reno out-competes Vegas, until Circuit-EWMA starts punishing them
(at the expense of more queue memory waste).

This also shows us a crucial testing consideration: [TCP_VEGAS_TOR] may
not compete well with our current SENDME congestion control, since it's
ability to set queue_target may cause it to de-prioritize itself below
competing SENDME flows. BOOTLEG_RTT_TOR, on the other hand, will allow
us to set its 'a' parameter at whatever point makes it competitive with
SENDME. For this reason, BOOTLEG_RTT_TOR may yield better performance
characteristics until all clients upgrade.

Section II.D: TCP Westwood [TCP_WESTWOOD_TOR] [PROPOSAL_CANDIDATE]

  https://tools.ietf.org/html/rfc8290#section-4.1
  http://intronetworks.cs.luc.edu/1/html/newtcps.html#tcp-westwood

TCP Westwood is essentially TCP Reno, but if the current BDP estimate is
greater than half the current congestion window, TCP Westwood uses that
instead of halving the congestion window, during a congestion signal.

In other words, upon a congestion signal, TCP Westood does:

   cwnd = max(cwnd * 0.5, BDP_estimate)

This has the effect of recovering from congestion events much quicker
than TCP Reno, and thus utilizing more of the path capacity.

Westwood+ also has some improvements on TCP Vegas's BDP_estimate in
order to smooth it in the presence of delayed and/or dropped acks:

    BWE_k = a*BWE_k-1 + (1−a)*BWM_k

We may not need this smoothing as much, because we do not have any
dropped acks, and dropped acks were far more damaging to TCP Vegas than
slightly delayed acks.

Section III: Queue Management Algorithms [QUEUE_MANAGEMENT]

On the Internet, Queue Management Algorithms are used by routers to
decide how long their queues should be, which packets they should drop
and when, and which connections to send ECN signals on and when. Queue
management is almost as large of a research area as congestion control.

Thanks to bittorrent and video streaming, queue management also grew to
consider fairness or QoS between flows. This is done by separating
queues or imposing a scheduling algorithm on packet ordering. For an
excellent overview of fairness vs queue management, see RFC 7806:
   https://tools.ietf.org/html/rfc7806

Since Tor relays cannot drop packets, our design goals are slightly
different than Internet queue management. We care about the following
design goals:
  1. Prioritize well-behaved/light circuits over loud/heavy ones
  2. Decide when to send ECN signals
  3  Decide which circuits to send ECN signals on
  4. Detect which circuits are queuing too much (or otherwise cheating)

Recall that Tor has already deployed Circuit-EWMA in order to meet goal
#1.  Circuit-EWMA tries to give priority to less loud circuits in the
circuit scheduler, so that interactive applications experience less
latency than bulk transfers.
  https://www.cypherpunks.ca/~iang/pubs/ewma-ccs.pdf

Circuit-EWMA can be viewed as a fairness algorithm as per RFC 7806
Section 2, since it schedules flows independently with the target
fairness property of prioritizing quiet circuits cells over loud ones.
When Circuit-EWMA is used in combination with RTT as a congestion
signal, it is effectively *also* a marking algorithm. This is because it
*also* distributes RTT congestion signals more frequently to loud
circuit:
  https://tools.ietf.org/html/rfc7806#section-2
  https://tools.ietf.org/html/rfc7806#section-3

Recall that Tor also had to deploy KIST, to move queues from the Linux
kernel into Tor, for Circuit-EWMA to properly prioritize them. See
Section 4 and 5 of the full-length KIST paper for details on this
interaction, and see page 5 for a good graphical depiction of all the
queuing points in Tor and the Kernel:
  https://matt.traudt.xyz/static/papers/kist-tops2018.pdf

It is likely the case that Circuit-EWMA and KIST are all we need if we
only use RTT signaling. But if we add in ECN, at minimum we will need to
update Circuit-EWMA to inform us which circuit(s) to send ECN signals
on. In fact, thanks to RFC7806's distinction between fairness and
management, modern Internet queue management algorithms compose well
with Circuit-EWMA, allowing us to still use Circuit-EWMA for the
fairness portion, but use an existing Internet algorithm for the ECN
marking. So it is useful to review Internet history here, as well as
current state-of-the-art.

On the Internet, queue management algorithms arose fairly early for one
primary reason: early routers performed what is called "tail dropping".
Tail dropping means that routers allowed their queues to get full, and
then started dropping all packets that arrived while queue remained
full. This has all kinds of negative consequences, but the easiest to
intuitively understand is that if you wait until your queue is full
before you start dropping packets, and then drop all of the packets that
arrive from then on, competing TCP connections will all back off at the
same time, leading to lulls in network utilization.

The key innovation to address tail dropping was to randomly drop (or ECN
mark) packets early, before the queue was full, with a per-packet drop
probability.  This probability was computed based on many factors of the
link, and sometimes based on individual TCP connection tracking.

Until very recently, both queue management and fair queuing algorithms
tended to have many knobs that must be tuned for specific kinds of links
and load levels. The Internet seems to be converging on three main
contenders for "knobless" queue management: PIE, ARED, and CoDel. Here's
some recent review material:

http://content.ikr.uni-stuttgart.de/Content/Publications/Archive/Wa_EUNICE_14_40255.pdf
  https://datatracker.ietf.org/meeting/88/materials/slides-88-iccrg-4/

However, be aware that these knobless systems make assumptions about the
congestion control algorithms and the latency of the connections that
may not apply to us. The assumptions are most clearly specified in
Section 3.2 of the CoDel RFC, and are closely related to the "TCP
Fairness vs TCP Reno" assumption from our congestion control review:
  https://tools.ietf.org/html/rfc8289#section-3.2

Ok, so let's review the top "knobless" contenders.

Section III.A: PIE and ARED [ARED_PIE_TOR]

  https://tools.ietf.org/html/rfc8033
  http://www.icir.org/floyd/papers/adaptiveRed.pdf

As a quick summary, both PIE and ARED manage their queues by dropping
packets probabalistically. In both systems, the per-packet drop
probability is updated in proportion to average queue transit time --
longer queues mean the drop probability is increased.

Both systems decide to drop individual packets from the queue in using
this probability. This has the effect that louder flows (with more
packets) are also more likely to experience drops (or ECN congestion
marking). But otherwise they have no explicit notion of fairness.

This makes them marking algorithms as per RFC7806's classification. If
explicit QoS or fairness is desired, they have to be deployed in
combination with a fairness algorithm to prioritize between separate
queues, again as per RFC 7806 Section 3.3:
  https://tools.ietf.org/html/rfc7806#section-3.3

Recall that Tor has multiple levels of queues: Circuit, TLS connections,
and kernel buffers, as per Figure 1 on page 5 in the KIST paper:
  https://matt.traudt.xyz/static/papers/kist-tops2018.pdf

Even if we keep Circuit-EWMA for scheduling (and thus fairness), we can
still add in PIE or ARED per each TLS connection. If the average total
queue length for all circuits on a TLS connection grows beyond the
marking threshold for PIE/ARED, we could then begin sending congestion
signals on circuits, as they send cells, as per the cell drop
probability of these algorithms.

However, because we only want to send one BACKWARD_ECN_TOR signal per
circuit, we will have to save these signals for circuits that have not
yet gotten one.  It is an open question how we should manage the mark
probability if we have to pass over a circuit because we already sent a
backward ECN signal on it.

Section III.B: CoDel and FQ_CoDel [CODEL_TOR] [PROPOSAL_CANDIDATE]

  https://queue.acm.org/detail.cfm?id=2209336
  https://tools.ietf.org/html/rfc8289
  https://tools.ietf.org/html/rfc8290

CoDel is a relatively recent algorithm that is gaining traction because
of its simple elegance. Like PIE and ARED, CoDel itself is a marking
algorithm, and can be combined with any fairness/QoS algorithm. It also
comes with its own fairness algorithm, called FQ_CoDel.
  https://tools.ietf.org/html/rfc7806#section-3.2

Despite being touted as parameterless, CoDel does have two parameters:
  inspection_interval <- set above max RTT (but not too high)
  min_queue_target <- set to 5% of min RTT

CoDel tags packets with timestamps as soon as they enter the queue.
Every inspection_interval, it examines the queue to see if any packets
have waited longer than target_delay. If they have, it begins dropping
packets (sending ECN signals in our case).

Because CoDel is accurately tracking the time *each* packet spends in
the queue, it is able to differentiate between "good" queues (not
congested), vs "bad" queues (congested), on a packet-by-packet basis.
This good vs bad distinction is determined by comparing the packet
transit time to min_queue_target. If a packet exceeds this transit time,
then the queue is bad, and the packet must be dropped or ECN marked.

It is this per-packet distinction that significantly separates CoDel
from PIE and ARED, which both use average queue times. This was the
source of some controversy, as per-packet timestamping was initially
argued to be too expensive. But on modern commodity hardware, this
per-packet timestamping is actually incredibly efficient. It also allows
CoDel to converge on the min_queue_target very rapidly, with high link
utilization.

However, it would seem that because we can't drop packets, sojourn times
will *not* decrease down to min_queue_target immediately upon congestion
for Tor.  We will have to wait for the actual backoff signal to
propagate (1 RTT) before the queue size changes due to backoff.
According to Toke Høiland-Jørgensen (one of the FQ_CoDel spec authors),
this should not be that significant, but it should be tested.

FQ_CoDel extends CoDel to provide fairness in a very similar way as
Circuit-EWMA does, by separating out TCP connections into "generated a
queue recently" and "did not generate a queue recently". This allows
packets from flows that are not causing congestion to be sent more
quickly than flows that are. This also has the side effect that when
connections build long FQ queues, their packets spend more time in
queues, and have an increased probability of being ECN marked or
dropped.
  https://ieeexplore.ieee.org/document/8469111

Like CoDel, FQ_CoDel relies on the assumption that it can drop a lot of
packets at once to correct queue sizes instantly:
  https://tools.ietf.org/html/rfc8290#section-4.1

This makes me think we should not rush to replace Circuit-EWMA with
FQ_CoDel, but we may get benefit from doing CoDel on a per-TLS
connection basis, just as is described for PIE and ARED above. In this
case, it is more clear which circuits to ECN mark. Whenever a cell has a
Circuit-EWMA circuitmux queue transit time longer than the target queue
time threshold (min_queue_target), that circuit gets an ECN mark. If it
has already been ECN marked, no biggie.  In other words: The decision to
ECN mark circuits is completely dependent only on how long their cells
traverse our queues, and does not require any capacity estimates or
other measurements. Very clean.

So if we decide we want to try ECN, I think Circuit-EWMA with CoDel is a
smart first choice. We would then use CoDel to deliver exactly one
BACKWARD_ECN_TOR signal when it detects a circuit's queues are too long.
This RTT/2 signal should cause the circuit to exit slow start more
quickly than RTT measurement by endpoints will.

Section IV: Next Steps [NEXT_STEPS]

After this extended review, I am convinced that an RTT-based approach
can provide a comprehensive solution to all of our design requirements,
and can be enhanced based on experimentation and further additions. Such
an approach also only requires clients and Exits (or just clients and
onion services) to upgrade to initially deploy.

However, we can still improve on this with a few enhancements. In
particular, we can compare [TCP_VEGAS_TOR] to [TCP_WESTWOOD_TOR] to
BOOTLEG_RTT_TOR, and see how they perform in competition with each
other, and in competition with Tor's currently fixed-size SENDME
windows. We can also implement CoDel in circuitmux to track transit
times and determine which circuits to send a single BACKWARD_ECN_TOR
signal, to exit slow start on circuits more quickly, and we can play
around with higher initial window sizes. We can do this experimentation
first in Shadow, and also on the live network via consensus parameters.

The immediate next steps here are to create a proper Tor Proposal from
the sections tagged [PROPOSAL_CANDIDATE] above.

-- 
Mike Perry

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 833 bytes
Desc: OpenPGP digital signature
URL: <http://lists.torproject.org/pipermail/tor-dev/attachments/20200602/d3ad532a/attachment-0001.sig>