[tor-dev] Options for Congestion Control in Tor-like Networks

Thu Jan 30 13:33:26 UTC 2020

This near-book-length mail is a pre-proposal to help us review previous
congestion control ideas for Tor, and brainstorm new ones. My goal is to
turn this material into one or more proposals and provide ways of
evaluating these options. With proper voodoo, maybe we can resurrect a
couple oldies from the graveyard of mixnet lore, and remix them into
some Vaporwave bangers. If we're real lucky with some fresh ideas, maybe
we can even make a hip hop Internet streaming chart buster from this mix.

But long mail is looong, ok? It's sooo long it has its own soundtrack.
Grab your favorite links from the references at the end, maybe print
everything out, find the nearest sofa or bed, and curl up with it all.
Flip on some Pretty Lights. Let's get that flow Finally Moving. Turn off
the tv! The flow is More Important Than Michael Jordan. Listen: the flow
is Hot Like Sauce. I hope you're an Organ Donor - Extended Overhaul
might be necessary. We're going on a mental Midnight Voyage by Ghostland
Observatory. But don't worry, we'll make it by 4 AM, without being
trapped there, I swear. Hopefully that's not Pushing Time with ya Ample
Mammal(s). Relax: there will still be time for some Codeine(g) Dreaming.
If you finish reading all of it before you go ROCKABYE BABY, you're
definitely a High Roller. All of this is almost exactly an hour of
Bumpin' In The Voodoo with Manic Focus.

If you'd rather take your time and chillax with something Muy Tranquilo
but less Gramatik, just put on some Blank Banshee. But don't fall
asleep, or George Clanton will Kill You In Bed.

Ready? Ok, back to serious business.

Motivation: Load balancing (TorFlow/sbws), circuit build timeout cutoffs
(CBT), and QoS (KIST+EWMA) have provided huge perf wins for Tor. Tuning
and improving these systems will continue to provide latency
improvements in the average case, but congestion control is the missing
piece we need for the network to operate properly at high utilization
levels. Congestion control is also necessary to increase the throughput
of the network at even at low utilization levels. Historically, Tor
performance shows high correlation to network utilization levels, and I
believe this is largely due to the effects of ephemeral congestion:
https://lists.torproject.org/pipermail/tor-scaling/2019-June/000051.html

In fact, Section 6.3 of
https://www.freehaven.net/anonbib/cache/murdoch-pet2008.pdf uses
Pollaczek-Khinchin queuing theory to show that expected Tor queue
latency is proportional to network utilization divided by spare
capacity, if queues are not otherwise bounded somehow (by congestion
control).

Summary: The rest of this post is organized as follows: First, I go over
TCP and ECN, because I build upon those for a couple new ideas. Then, I
summarize Tor's current SENDME flow control and review its many
failings. Then, I review past attempts at congestion control for Tor in
the research literature. Then, I propose four new candidate ideas.
Finally, I conclude the post with some ideas for evaluation and further
analysis.

Unless you are a congestion control expert who is also deeply familiar
with Tor's many failed attempts in this area, I strongly recommend
wading through this post in order. The summaries are meant to get you up
to speed without having to do quite as much reading as I did, and they
are filled with in-line URL references in case you want to dig deeper.

If you still want to skip ahead to the highlights, search this mail for
[PROPOSAL_CANDIDATE]. There are four such sections. Those sections build
on other ideas, though. For those, search for [TCP_HISTORY],
[BOOTLEG_RTT_TOR], [FORWARD_ECN_TOR], and [BACKWARD_ECN_TOR]. Search for
[TRACK_LISTING] to find a summary table of all the new ideas from this
post, with their associated key properties.

Crucially absent from this pre-proposal post is the closely related
discussion of QoS/scheduling, which we need to make decisions about
which circuit(s) to select for delivering congestion control signals,
when congestion occurs. For now, this post just handwaves this piece as
"use either EWMA or something RED-like". There is much literature on
these mechanisms, enough to warrant separate treatment, but thankfully
the choice of which to use is orthogonal to the choice of congestion
control signal delivery mechanism.

-I. History of Internet Congestion Control [TCP_HISTORY]

In TCP, each endpoint maintains send and receive buffers of packets,
called windows. The receive window holds packets until a contiguous
chunk is received, at which point data is delivered to the application
and an acknowledgement packet is sent to the other end. The send window
size is doubled every ack until the first packet is dropped, after which
it is increased by 1 for each acked packet, and halved for each drop.
This is called Slow Start with AIMD (Additive-Increase
Multiplicative-Decrease). Packets that are not acked are retransmitted
from the send window buffer after a timeout of one RTT (Round Trip
Time). Drops are caused by intermediate routers' fixed-size queues being
full, but can also be caused by poor network conditions (which leads to
sub-optimal throughput). In this way, TCP responds to congestion
anywhere on the path, and takes around one RTT to detect said congestion.

Even this detailed summary is a simplified model. In reality, TCP is way
more complicated than that. The real thing has like 23 states, and a
whole bunch of kruft from the 80s and 90s that we'll want to cut out of
our mix. We just need the key hooks.

Decades later, Explicit Congestion Notification (ECN) was proposed to
allow routers to explicitly set a flag on TCP packets to signal that
their queue is getting full, instead of dropping packets. For abstruse
compatibility and reliability reasons, when a router adds this flag to a
packet, it is sent all the way to one endpoint. That endpoint *then*
re-adds the flag to acks going in the other direction, all the way to
the other endpoint, who then backs off as if it detected a packet drop.
This is called Forward ECN: https://tools.ietf.org/html/rfc3168#section-6.1

Unfortunately, because of the requirement to bounce the congestion
notification flag all the way off the other endpoint, it *still* takes
Forward ECN at least one RTT to respond to congestion, and so almost no
intermediate routers have bothered to deploy it -- the gains are only
marginal.

Also interesting for our purposes is the vastly simplified Backwards ECN
that uses separate ICMP signaling to eliminate the endpoint echo RTT for
much faster responsiveness to congestion:
https://tools.ietf.org/html/draft-salim-jhsbnns-ecn-00

Backwards ECN was never widely deployed either, because ICMP is easy to
spoof+spam, requires extra packet overhead, many routers filter it, and
no strong authentication binds it to the TCP connection.

But enough history! Onward!

@. Status quo of the Tor flow: SENDME sadness and pain

Tor has a window-based flow control system called SENDMEs. This system
does not provide congestion control, as window sizes are hard-coded
constants. I believe it is important to review this system in detail, so
we can avoid making any of the same mistakes again. Bear with me.

Recall that Tor circuits are 3 hops long when contacting the Internet
(Guard, Middle, Exit), and 7 hops long when a client contacts an onion
service. TCP is used for connection in between these hops. One or more
reliable streams are multiplexed inside each circuit.

Each endpoint of a circuit (client and Exit, or client and onion
service) maintains a remaining-to-send cell count for each circuit and
stream, which is refilled when it receives a SENDME from the other end.
Each time it sends a cell, it decrements this count. When this count
reaches 0 without receiving corresponding SENDMEs, it will stop sending
data. The initial value of these counts is 1000 cells for circuits and
500 cells for streams, and each SENDME refills 100 cell counts for
circuits, and 50 cell counts for streams. Crucially: no backpressure
happens on interior relay connections for this windowing system.
Instead, we just queue.

If you have a TCP background, you may find it easier to mentally replace
"SENDME" with "ACK", and you won't be far off. Its just that our acks
always ack a fixed amount of cells, and the window size is also fixed.

In the steady state stage, window updates of 50 or 100 cells per SENDME
results in an upper bound of performance inversely proportional to half
the circuit RTT. This is 2*CELL_SIZE*50/RTT for streams, and
2*CELL_SIZE*100/RTT for circuits. For streams with a ~100msec RTT, this
is ~500Kbytes/sec.

This means that once the Tor network has enough capacity for RTT to be
close to link transit time (ie: no queue delay), adding more Tor relays
will *not* make Tor faster. This also means that onion service
throughput will almost always be much lower than Exit throughput. The
RTT is much much higher for 7 hop onion circuits than for 3 hop exits,
so even if there is plenty of spare network capacity, Onion service
sites will *never* download as fast as their clearweb equivalents with
our current flow control.

All of this is done just to provide flow control, so that data flow can
stop in the event of endpoint stalls and/or interior relay failure. It
does not actually provide fairness or limit congestion when multiple
clients are used. Additionally, because nothing proves the endpoint has
read the data, a client that does not read data but keeps sending
SENDMEs to the Exit can deliberately force queues to build up in the
Guard node (due to the client-blocked TCP layer), to induce OOM
conditions and crashes. We need to upgrade all clients and all Exit
relays to fix this problem, which will take quite some time:
https://gitweb.torproject.org/torspec.git/tree/proposals/289-authenticated-sendmes.txt

Even with honest or authenticated behavior, the SENDME protocol means
that each circuit can queue up to 1000 cells at an overloaded bottleneck
Tor router, which load balancing can help alleviate, but can't correct
in transient and degenerate cases. This is why high capacity Exits have
obscene memory requirements when Exits are scarce (ie: below ~1/3 total
network throughput), despite there being no memory leaks.

This also tells us that Tor relays *cannot* scale to support larger
numbers of users without a corresponding linear increase in memory
requirements.

As a side effect of this protocol, it is possible for the client and the
Exit to compute the RTT of a circuit by measuring time between every
50th or 100th sent cell and the associated SENDME arrival. Some previous
work makes use of this to try to detect congestion, but it is not a
property we want to keep, if we can avoid it. No deployed code uses it,
and its presence is worrisome:
http://people.cs.ksu.edu/~eyv/papers/latency_leak-ccs07.pdf
https://www.robgjansen.com/publications/howlow-pets2013.pdf

To further complicate things, these SENDME windows only apply to
end-to-end data cells on the circuit. Tor also supports other non-data
end-to-end cells, as well as non-end-to-end data cells (so-called "leaky
pipe topology", which is used for circuit padding), which are not
counted against SENDME windows. Both of these details have also caused
us problems in the SENDME design. Still, for now, we will ignore both of
these details. Full proposal treatment will need to consider them, though.

I. Vaporwaving to ghosts in the anonymity graveyard

Ok, so let's party like it's 2010 and go chase some ghosts of past
anonymity literature and lore.

In my review of this area, I have identified the following causes of
death for previous congestion control attempts in Tor:
  * Side channels and other anonymity risks
  * Slow or limited responsiveness to queue pressure
  * Poor fairness properties
  * Poor throughput
  * Lack of stream flow control
  * Endpoint fairness cheating ability
  * Deployment costs

Ghost 1: Drop Signaling
Cause of Death: Side channel issues; deployment cost

Early proposals for congestion control for Tor attempted to simply
tunnel OS TCP streams. This uses the OS's native implementation of TCP
Slow Start and AIMD as congestion control methods. These attempts all
quickly died, due to OS TCP stack fingerprinting problems (ie: nmap):
https://murdoch.is/papers/tor11datagramcomparison.pdf

Two years ago, I tried to make the case for using QUIC as a drop-in
replacement for Tor circuits, and use QUIC's streams to carry Tor
streams. Such a network could leverage QUIC's TCP-like drop-signaled
congestion control algorithm internally, and terminate QUIC at the Exit
node, transforming the QUIC streams into TCP connections initiated from
the Exit. This avoids the TCP fingerprinting issues. In fact, even
without a full datagram network, Tor could still deploy relay-only drop
signaling if we added windowing and retransmission at the circuit layer:
https://lists.torproject.org/pipermail/tor-dev/2018-March/013026.html

Unfortunately, the present/absent bit vector of missing packets in this
window is a communication channel between malicious Tor relays or
Internet routers and the Exit node. This effect was an obscure piece of
ancient mix network lore until Nick Mathewson publicly documented it on
tor-dev a little over a year ago:
https://lists.torproject.org/pipermail/tor-dev/2018-November/013562.html

I was the voice of optimism in that grim post, but there is not a lot of
hope to be had. It still seems to me that adding cover traffic will
reduce the bandwidth of this side channel, but it is a deeply unsolved
traffic analysis problem to quantify the level of cover traffic needed,
on top of the engineering expense of protocol and cryptographic
revamping needed to support cell drops at intermediate relays.

It may also be possible to add a MAC scheme that prevents excessive
drops by intermediate relays and/or internet routers, but then we may
lose a lot of congestion signaling capability and responsiveness.

Aside: This drop side channel is much worse for VPN systems that blindly
tunnel OS-level TCP. Intermediate Internet routers between the VPN
client and the VPN server can use this side channel to send information
to any Internet router after the VPN server, via exposed TCP sequence
numbers. In a Tor-like system with Exit stream termination, the Exit
relay must be one of the side channel participants.

For more fun VPN TCP side channels, see also
https://seclists.org/oss-sec/2019/q4/122 and
https://tools.ietf.org/html/rfc6040.

Ghost 2: Datagram Tor with uTP/LEDBAT Signaling
Cause of Death: Responsiveness; fairness/cheating; side channels

uTP/LEDBAT is the bittorrent transport that measures the RTT of a
connection, and backs off if high latency is detected. The theory is
that latency above the maximum acceptable value (100ms by spec)
indicates that router queues have formed, due to competing traffic
causing a bottleneck. Typically these queues form at poor-quality
consumer edge routers that queue way too much for their link capacity
and user counts.

https://tools.ietf.org/html/rfc6817
https://research.torproject.org/techreports/libutp-2013-10-30.pdf

Because LEDBAT's goal is to use the link fully, and yield capacity in
presence of competing traffic, LEDBAT congestion control requires that
target latency upper bounds be known, and also expects some queuing to
occur to cause this latency. If network drops still occur, it also uses
TCP-like behavior, with similar side channel issues for our usecase. If
inherent path latency ever exceeds the LEDBAT target max, throughput
will plummet to near-zero. This is bad for Tor, as our path latency is
highly variable, especially between 7 hop onion circuits vs 3 hop exit
circuits.

Additionally, malicious LEDBAT clients can cheat by tolerating a
*larger* max delay than other clients. They will thus accept larger
queue sizes, which allow larger windows, which allow marginally better
throughput, at the expense of more congestion latency for everyone.

Ghost 3: Congestion Aware Tor
Cause of Death: Hazy anonymity analysis; responsiveness; throughput

Congestion Aware Tor uses more detailed per-node latency estimates
to measure transient congestion-induced delay and respond to it by
altering path selection:
https://www.cypherpunks.ca/~iang/pubs/Congestion_Aware_FC12.pdf

The downside of Congestion Aware Tor is that rather than managing the
congestion window, it suggests simply migrating circuits and changing
path weights. This has hazy effects on anonymity, as the set of viable
paths in the network is reduced by some hard-to-predict amount. Our
Circuit Build Timeout code (CBT) is able to do this kind of path pruning
in a significantly more precise manner by setting the circuit timeout
such that nearly exactly 80% of all possible network paths are used, but
it does not continually perform this timing measurement once the circuit
is built.

We could build a hybrid system with CBT-style math, with continual
measurements, with circuit migration, and with conflux multipath, but
this will still require retransmission buffers for reliable circuit
migration, will be slower to respond than true window control, and can't
actually cap congestion. Worse, circuit migration won't reduce
congestion at Exit nodes (you have to keep the same Exit or streams will
die).

Even with these changes, it also will not remove the throughput cap, or
improve onion service throughput.

Ghost 4: N23 Tor (aka DefenestraTor)
Cause of Death: Stream control; side channels; poor throughput

N23 Tor is described in Section 4.2 of the DefenestraTor paper:
https://cseweb.ucsd.edu/~savage/papers/PETS11.pdf

Basically, each router limits queue length for each circuit to N2 + N3,
which are consensus and circuit parameters, respectively. N3 is tuned
per circuit by latency measures similar to Congestion Aware Tor, but is
still capped at 500 per circuit.

This system died primarily because we could not figure out how to bolt
stream flow control back on to it. But that is a solvable problem (and
any circuit-level congestion control system we deploy must solve it):
https://lists.torproject.org/pipermail/tor-dev/2012-November/004138.html
https://lists.torproject.org/pipermail/tor-dev/2012-November/004143.html

I was not involved in the evaluation of N23 because I was designing and
building Tor Browser at the time, but upon diving into it now, I have
many additional concerns.

First: Because queues are per-circuit but backpressure is
per-connection, the backpressure suffers from multiplexing information
leak issues. If a circuit manages to fill its N23 limit, the only way
for the system to have backpressure to enforce N23 queue sizes is to
stop reading entirely on that circuit's upstream TCP connection,
stalling *all* other circuits on *only* that particular connection. This
property is extremely worrisome from an anonymity standpoint, as it
gives a client adversary a mechanism to experimentally stall pairwise
router connections to probe for the path taken by a long-lived active
target circuit (such as an onion service intro point).

Second: It's not clear to me how credits are sent in both directions on
a circuit, so that the congestion control can be applied in both
directions. Does anyone remember? (Does anyone care?)

Third: At the end of the day, the system still has total queue lengths
that scale in proportion to the number of circuits on the relay, rather
than being globally fixed and controlled through fairness properties. In
my view, this is the final nail in the coffin: it's not an improvement
for scalability, actually decreases throughput (due to the new lower
window cap), and can't control congestion enough to reduce the long-tail
latency.

Indeed, the perf CDFs from the paper show all of these effects, if you
look closely.

Ghost 5: RTT Tor [BOOTLEG_RTT_TOR]
Cause of Death: Sender cheating; Small max window size

Buried in the Defenestrator paper is an unnamed system that uses RTT to
estimate congestion and update windows accordingly, much like LEDBAT. It
is described in Section 4.1 of
https://cseweb.ucsd.edu/~savage/papers/PETS11.pdf

I'm going to call this system RTT Tor. RTT Tor works by tracking the min
and max RTT observed on a circuit, using the SENDME cell counting side
effect. It then picks a threshold timeout T, at a tuneable point in
between the min and max observed RTT. If the most recently measured RTT
exceeds T, the circuit is declared congested. Unlike LEDBAT, because RTT
Tor dynamically chooses T rather than using a hardcoded target RTT, it
is possible to use on Tor.

The window starts at 100 cells. So long as the RTT stays below T, the
window grows by an additional 100 cells every SENDME. If the RTT exceeds
T, the window is cut in half.

Window size max is capped at 1000 cells. The window cannot go below 100
cells.

This system depends upon the Exit node's ability to measure RTT to the
client in order to have a downstream window. This capability has been
shown to impact client anonymity:
http://people.cs.ksu.edu/~eyv/papers/latency_leak-ccs07.pdf
https://www.robgjansen.com/publications/howlow-pets2013.pdf

Furthermore, endpoints can cheat by choosing a higher T value or
otherwise acting as if their window size is increasing when it is not.
However, while the paper does not describe a solution for such cheating,
it is possible to detect if one endpoint is honest. Because both
endpoints can measure the RTT, both endpoints can track what their
peer's window should be growing to based on this algorithm, and compare
that to an estimate of their peer's window size. The peer's window size
can be estimated by counting the number of cells that arrive in RTT/2
amount of time.

If the RTT is asymmetric or highly variable, this detection mechanism
will have false positives, but we can at least reduce the ability and
incentive to cheat so long as one of the endpoints is honest.

But if both endpoints cheat, intermediate routers have no way of
limiting the data that is in flight or can queue up, except for closing
the circuit through the circuit OOM killer.

Hence the system still must impose a safe cell window max. It must also
impose this max size because some circuits may have an RTT_min that is
not much different from the RTT_max, even though the circuit is very
congested (because it is continually very congested). In this scenario,
the system would keep adding more congestion until the circuit OOM
killer kicks in.

But, as far as options go, this is not a bad one. Plus, it only requires
the client and the Exit to support it.

II. Fresh Remixes and New Ideas

So what if we took some hits from the 90s and remixed their hooks, Tor
style?

Let's start with the simplest, most TCP-like scheme we can come up with,
and then discuss potential variants and improvements to them, and see
which of these ideas mix together well.

Remix 1: Forward ECN [FORWARD_ECN_TOR]

What if we tried to make something as close to TCP ECN as possible for Tor?

Let's use an initial window size of 100 cells, and make our SENDMEs into
100 cell ACKs.

Let's create a RELAY_COMMAND_QUEUE_PRESSURE relay cell command that can
be sent by a relay towards a client whenever the sum total of queued
cells exceeds some limit. Which circuit gets the cell can be chosen by
EWMA or some other QoS algorithm (ie: RED-like weighted random choice,
as per TCP ECN). The cell would indicate the direction that the circuit
was loudest on (ie: client-bound or exit-bound). When the client gets
the cell, if the direction is exit-bound, the client halves its outbound
window. If the direction is client-bound, the client sends the cell back
to the Exit node, who then halves its client-bound SENDME send window.

(Recall that Tor's "relay commands" are encrypted to/from the client,
with Tor's per-hop circuit keys. This means that extra relay cells can
be injected by any hop in the circuit towards the client, and
intermediate relays cannot read these cells' relay commands. As we will
soon see in [BACKWARD_ECN_TOR] and related ideas, using cleartext "cell
commands" instead can help make things more efficient by allowing
signaling of congestion without requiring extra cell injection, but this
also introduces more side channel risk).

We can use Slow Start with AIMD here: Before the first
RELAY_COMMAND_QUEUE_PRESSURE, outbound windows would double every
window. After the first RELAY_COMMAND_QUEUE_PRESSURE, they would
increase by say 100, every window. Or something similar.

To reduce the potential for side channel injection, the client can
ensure that RELAY_COMMAND_QUEUE_PRESSURE do not arrive more often than
once per window. To prevent patterns from trivially being encoded on
these cells, the client can delay relaying them until they are on a K %
M == 0 boundary, for M=5 or 10 or so (high M will reduce responsiveness
to congestion). The client can also refuse to relay these cells if they
would cause the Exit's window to drop below some minimum size (say 128).

Intuition tells me the relays should use total outbound queue size for
deciding *when* to send these cells, and use specific circuit queue
length and/or activity to decide *which* circuit to send them on. In
this way, we globally limit queue size (per hop latency) regardless of
the number of circuits, and still enforce fairness among these circuits.
However, picking parameters and algorithms for all of this is a research
problem, or at least requires much more QoS RFC and research literature
review. There may be additional optimizations too, like sending these
cells on circuits of any TCP connection that starts blocking or
exceeding per-connection queue limits.

While this will work if clients are well-behaved, this system has a few
drawbacks.

First, it will be slow to respond to client-induced download congestion
at middle nodes, since we have to wait a full circuit RTT for the client
to relay cells to the Exit. Since the Exit is responsible for the
download window for web-like clients, this is also the direction that
needs the most congestion control, which is unfortunate.

Second, clients can easily cheat, even by malfunction: all they have to
do is refuse or forget to relay cells to an Exit, and they get faster
downloads at the expense of more network congestion for everyone. We
could do what RTT Tor did and cap the window size at 1000, but then
we're back in the situation where throughput is artificially capped at
some bizarre function of circuit RTT, and worst-case latency will not
improve as much as we want. We could rely more heavily on the circuit
OOM killer, and kill circuits that queue too much on the assumption that
they are cheating/malfunctioning, but tuning this to avoid false
positives may be tricky.

Third, while this system could technically be deployed even if some
nodes do not support it yet, this is risky, as nodes who do not support
it will experience excessive congestion at worst, and no improvement at
best.

Fourth, onion services are tricky to handle. Because of the way
rendezvous circuits are joined, this system will send
RELAY_COMMAND_QUEUE_PRESSURE towards the client for the client-side of
the rend circuit, and towards the service for the service end of the
circuit. Then, either of these sides would echo that signal as a
RELAY_COMMAND_QUEUE_PRESSURE *all* the way to the other side. Either or
both could cheat in this case, and again, the only recourse we have is
for each relay to enforce strict circuit queue length limits via the
circuit OOM killer.

Remix 2: Forward ECN mixed with RTT Tor [FORWARD_ECN_RTT_TOR]
[PROPOSAL_CANDIDATE]

Ok, [FORWARD_ECN_TOR] sounds decent, but because the client mediates
everything, it will be slow to respond to client-destined download
congestion, and clients can cheat. Also, if some relays don't support
the signal yet, the window may become too large for their congested status.

What if we *also* mix in [BOOTLEG_RTT_TOR], so that an endpoint's send
window can only grow if *both* conditions hold: the measured circuit RTT
stays below RTT Tor's T threshold, *and* there are no congestion signals
coming from [FORWARD_ECN_TOR]?

The addition of the signaling cell from [FORWARD_ECN_TOR] also allows us
to be more robust in the cheating detection described in
[BOOTLEG_RTT_TOR], and eliminate false positives. If either endpoint
detects a window that is growing too large for their measured circuit
RTT (by counting the number of cells arriving in that RTT, and comparing
that to what the window should be based on the RTT window growth
algorithm), it can send a warning shot RELAY_COMMAND_QUEUE_PRESSURE,
explicitly telling the cheater to back off. If the window still grows
because the other endpoint ignores this warning shot, it can close the
circuit.

This system is fully incrementally deployable: it can be used if only
the client and Exit support it. It is resilient to cheating so long as
Exits are honest, without false positives, and even without intermediate
relay support.

While we still aren't doing better than 1 RTT response to congestion, we
now have a good answer for cheating that doesn't have any false positive
risk for good actors. Furthermore, because we also have explicit
congestion signaling, we no longer have to worry as much about imposing
a max window size, to protect circuits for which RTT_min is close to
RTT_max due to persistent congestion. Good actors on these circuits will
back off, and for actors acting badly enough to add additional
congestion, RTT_max will increase enough for them to eventually be detected.

Downside: Since onion services have no Exit node, if we want to use the
RTT mechanism to mitigate cheating for them, we must terminate
congestion control at the Rendezvous point for this proposal idea, so
that onion service clients and services can't collude to get faster
service. The RP would then be responsible for examining each half of the
spliced circuit's congestion windows, and sending ECN signals down
whichever side had a larger window. This may or may not be worth the
additional complexity, as opposed to just employing the circuit OOM
killer if circuit queues get too long (due to cheating).

The only remaining downside I see is that we are relying on baking the
ability to make RTT measurements into the protocol.

Still, I think this is worth turning into its own proposal. Does anyone
see any other issues, or have any suggestions?

Remix 3: Backward ECN [BACKWARD_ECN_TOR]

Ok so the [FORWARD_ECN_RTT_TOR] mix is pretty tight. We've got an
incrementally deployable system that is somewhat resistant to cheaters,
and responds to congestion within an RTT. Can we do better? Yes.
Remember that Backward ECN IETF draft that we covered in [TCP_HISTORY]
earlier? It had that RTT/2 response with a hyphy hook. Let's sample that.

First: just like the other proposals, we still need to use SENDMEs, to
ack received packets. However, since we are not measuring RTT here, it
does not have to be exactly every 100 cells. The SENDME can be sent at a
randomized cell count to obscure RTT, with a different randomized ack
cell count value included in the SENDME.

In addition to RELAY_COMMAND_QUEUE_PRESSURE, which is sent to the client
and can't be read by any other intermediate relays, let's make a new
*cell* command type, CELL_COMMAND_RELAY_QUEUE_PRESSURE. Because this is
a cell command, it can be read by intermediate relays, and can be set by
an intermediate relay on any cell heading towards the Exit, simply by
flipping cell_t.command from CELL_COMMAND_RELAY to
CELL_COMMAND_RELAY_QUEUE_PRESSURE, so long as all relays in the circuit
understand this new command. This also avoids sending any additional
empty cells when congestion occurs in the common Web download direction,
which is nice.

We should still respond to congestion with Slow Start AIMD: when the
client or exit gets this relay cell or cell command, it cuts its send
window in half in that direction. Otherwise windows grow during
transmission similar to TCP (ie double the window size each window until
the first congestion cell, then linear).

The downside of these cells being a new end-to-end cell_t.command is
that it opens up side channel attacks we saw with the RELAY_EARLY attack:
https://blog.torproject.org/tor-security-advisory-relay-early-traffic-confirmation-attack

The attack is to flip this cell_t.command field on a sequence of cells
to encode a bitstring, to enable the Exit to send data to the Guard. So
in this mix, let's still use RELAY_COMMAND_QUEUE_PRESSURE towards the
client. We will only use the relay-visible cell_t.command
CELL_COMMAND_RELAY_QUEUE_PRESSURE towards the Exit. This means we only
need to worry about Guard to Exit side channels, which require much more
information to be useful (ie: they must encode client IP address or some
other unique client identifier).

Note that because all intermediate relays can see the cell_t.command,
they can enforce similar properties as clients did in [FORWARD_ECN_TOR],
to limit side channels in the Guard to Exit direction. For instance,
they can ensure that these commands do not get sent more often than once
per window of client-bound incoming cells. They can also enforce that
the Exit-bound cell_t.command = CELL_COMMAND_RELAY_QUEUE_PRESSURE in the
outgoing direction is only set on K % M == 0 cell count boundaries, for
low M=5 or 10. Note that this K %  M spacing actually works better for
longer circuits, since there is more chance for other relays to flip the
congestion bit at each M spacing position, which damages the reliability
of the side channel signal.

When middle relays enforce that the window is not allowed to drop below
win_min=128, a malicious Guard can only inject
log2(win_size)-log2(win_min) of these cells in a burst. Once the middle
node detects that the window size would be below the legal minimum, it
knows either cheating is happening, or a side channel is being used. For
a typical win_size of ~8k (ie: 16X faster throughput than current Tor)
and a win_min=128, this detection will be triggered in ~6 properly
spaced fake CELL_COMMAND_RELAY_QUEUE_PRESSURE commands.

However, in the AIMD steady state for this side channel, a malicious
guard can attempt to keep the window size as close to win_min as
possible, without going below it. Since the window size grows linearly
after the first ECN signal, the guard gets to send an additional
CELL_COMMAND_RELAY_PRESSURE cell_t.command around once every win_min
client-bound incoming cells. So long as no other relay is also sending
congestion control signals, the guard can keep flipping bits roughly
this often (though the K % M == 0 requirement would restrict the
placement of these cells).

Downsides: Even just 6 bits of reliable information can be damaging if
target client anonymity sets are already small, and so long as
client-bound incoming cells keep arriving, there are still more
opportunities to flip the cell_t.commands from the Guard to the Exit.
The big question here is can these 6+ cell_t.command flips be turned
into a large side channel, despite K % M == 0 placement enforcement? And
can this side channel be made reliable? If so (and it seems likely that
this is so), then this whole idea needs revision (see
[EDGE_FORWARD_ECN_TOR] and [START_BACKWARD_ECN_TOR] for possible revisions).

We also need all intermediate relays to upgrade to properly forward the
CELL_COMMAND_RELAY_QUEUE_PRESSURE cell command just like
CELL_COMMAND_RELAY.  Also, clients can still cheat by ignoring the
encrypted RELAY_COMMAND_QUEUE_PRESSURE. On the plus side, clients who
cheat can at best get faster upload speeds. They cannot get faster
download speeds unless the Exits also decide to cheat.

Finally, the story for cheating onion service endpoints is similarly
complicated: Do we have the RP mediate, or fall back to the circuit OOM
killer yet again?

Remix 4: Backward ECN mixed with RTT Tor [BACKWARD_ECN_RTT_TOR]

If we really are worried about upload cheating, we can mix RTT Tor back
in, just like we did for [FORWARD_ECN_RTT_TOR], and use its cheating
detection. An endpoint that detects abnormally large window sizes (based
on arrival cell counts per RTT) could send the
RELAY_COMMAND_QUEUE_PRESSURE warning shot to reduce the window.

So now we have a system that responds in RTT/2 when supported by all
intermediate relays, and falls back to RTT response in the presence of
one cheater endpoint or lack of support by intermediate relays.

Unfortunately, this still has the side channel issues of
[BACKWARD_ECN_TOR], and so I have not tagged it as a proposal candidate.

Remix 5: Backward ECN with Transparent Trap [BACKWARD_ECN_TRAP_TOR]

Ok so the "SENDME ack exactly 100 cells" 2-step beat is getting a little
tired. Can we devise any ways to trap cheaters without having a
predicable RTT measurement baked into the protocol, and still respond to
congestion in RTT/2?

There is a way to defeat the wizard. But there be dragons in this trap
house, and they really want to soul bond using side channels while
everybody is watching. Hence I have not flagged this as a proposal
candidate. Plus, as these mixed metaphors and obscure references
suggest, this remix is very busy and  complicated. Still, let's consider
it because it may be useful to remix later.

Let's take [BACKWARD_ECN_TOR], and instead of using
RELAY_COMMAND_QUEUE_PRESSURE towards the client, let's use
cell_t.command=CELL_COMMAND_RELAY_QUEUE_PRESSURE in both directions.
This allows us to perform the same kinds of side channel and cheating
checks at the middle node in the Exit to Guard direction, and also
avoids sending any extra relay cells during congestion.

Just as in [BACKWARD_ECN_TOR], let's randomize the cell count at which
we send each SENDME, and include a different randomized ack count in the
SENDME, so that each endpoint can add cells back to the window for that
SENDME, but cannot compute RTT. Let's also still use Slow Start with
AIMD (double the window if there are no congestion signals, half the
window on congestion signals, and grow it linear after that).

However, this is still not sufficient to stop Exit to Guard side
channels! The Exit to Guard direction is different than the Guard to
Exit direction: relays can inject arbitrary amounts of full relay cells
for the client in the Guard direction, whereas they could not do so in
the Guard to Exit direction. In Tor (and other mixnets), this property
is called "leaky pipe topology", and is what allows us to send
RELAY_COMMAND_DROP padding cells from the middle relay to the client
(and vice-versa).

This means that the combination of these injected cells plus congestion
control can be used to encode a pattern from Exit to Guard, and subvert
the checks at the middle node, because relays can inject dummy cells
sent towards the client to meet the win_size and M spacing requirements.

The good news is that the client-side circuit padding system only allows
RELAY_COMMAND_DROP from a hop for which the client has negotiated a
padding machine. This means that Exit nodes cannot inject padding for
assistance in any side channel usage, regardless of the congestion
control scheme we choose.

The bad news is that vanilla Tor allows the presence of invalid
protocol-violating cells, and does not close the circuit for these
cases. Best case, it issues a log message. Worst case: it silently drops
them. Thankfully, the vanguards addon is designed such that any cell
that is not specifically marked as valid by Tor counts as a dropped
cell, after which the circuit will be closed. This is a fail-closed
defense. If new code is added for cell processing that forgets to flag
cells as properly processed by calling circuit_read_valid_data(), those
cells count as false positive dropped cells, rather than false
negatives. This impacts reliability, but has the advantage that we won't
shoot ourselves in the foot by adding new code that is too permissive
without noticing that this code simply does not work properly because
circuits get closed when it is exercised.

All this means we will need to port this defense into Tor itself,
though, or dummy cells can be injected by the Exit for side channel
assistance in congestion control schemes or other cases:
https://github.com/mikeperry-tor/vanguards/blob/master/README_TECHNICAL.md#the-bandguards-subsystem

But what about legit padding causing the Guard to think that win_size
and M spacing requirement are violated? Well, so long as any relay with
an active padding machine stores the congestion signals so that it can
delay them such that they are sent on the next K % M == 0 boundary after
padding is applied, these limits can still be enforced, without
revealing the quantity of padding. Additionally, any guard node
congestion caused by the sum of real+padding traffic can be controlled
under this scheme, because the middle relay can observe the guard's
congestion signals and decrease padding approperiately.

Downsides: If preventing *any* side channel bits from flowing from the
Exit to the Guard is important, this scheme likely does not accomplish
that. It does seem that Exit to Guard side channels are far more
dangerous than Guard to Exit, since all the Exit has to say is "hey this
client looked at something I think is interesting", where as the Guard
has to communicate a unique identifier for the anonymity set, so that
concern alone may kill this idea.

This scheme is also complicated, and any implementation errors will
bring back the RELAY_EARLY side channel attack in the Exit to Guard
direction, even if the math might claim otherwise.

Remix 6: Edge-Forward ECN [EDGE_FORWARD_ECN_TOR]
[PROPOSAL_CANDIDATE]

The common problem with both [BACKWARD_ECN_TOR] and
[BACKWARD_ECN_TRAP_TOR] is that they enable side channels by one or both
of the edges of the circuit (Guard to Exit, or Exit to Guard).

Forward ECN mitigates these side channels, because the edge no longer
gets to encode its congestion signal at specific points in a traffic
pattern. This works better than win_size and M position limiting by
itself, and it can be used in combination with client-side window size
checks and M position limiting, too.

If middle nodes could somehow reliably detect the Guard and Exit
endpoints of a circuit, then the middle node(s) could force one or both
endpoints to use Forward ECN [FORWARD_ECN_TOR]. Then the circuit edges
would not be able to communicate with each other by directly encoding
congestion signals in deterministic traffic patterns.

This is easier said than done. In practice, middles can tell that they
are middles because they do not get any authentication cells from the
client, and they are not asked to open any streams. But what we really
need is for middles to be able to tell if their *neighbors* are the
circuit edges, because these edges are the only ones that we actually
need to forbid from using cell_t.command from BACKWARD_ECN_TOR; it is
fine if multiple middles use cell_t.command (for example, in onion
service circuits).

The only way I can think of to reliably detect circuit edges from the
middle position is to forbid either Exits or Guards nodes from being
used as middles, so that middles could know when they were next to one
of these node types, and therefore next to the edge of the circuit. Then
middles could forbid these edges' use of the cleartext
CELL_COMMAND_RELAY_QUEUE_PRESSURE. This only-middles-as-middles
stratification requirement has been proposed before for several other
reasons, so maybe it is not a horrible idea?

This is more tricky for onion service circuits, though. For those, we
could require that sensitive positions (such as HSDIR, RP, and IP) also
be Exit-flagged nodes?

Or maybe there is another clever way to detect just the edges of a
circuit from the middle position?

Even if there is, this proposal still has some cheating risk, as clients
can refuse to relay these edge congestion signals to get marginally
better throughput. Again, RTT measurements can be preserved here as a
backstop against cheating, but for the additional anonymity risk.
Otherwise we have to fall back to the circuit OOM killer.

Remix 7: Start Backward ECN [START_BACKWARD_ECN_TOR]
[PROPOSAL_CANDIDATE]

The Slow Start with AIMD window management for all of these systems
means that we most need the quick RTT/2 response to the congestion early
on in the Slow Start phase, while the window size is growing
exponentially. This is also when the network is most vulnerable to cheaters.

So let's allow only 1 (or maybe 2) CELL_COMMAND_RELAY_QUEUE_PRESSURE
cell_t.commands per circuit in the Guard to Exit direction as per
[BACKWARD_ECN_TOR], and maybe even 1 CELL_COMMAND_RELAY_QUEUE_PRESSURE
cell_t.command in the Exit to Guard direction from
[BACKWARD_ECN_TRAP_TOR]. After these cell_t.command limits are hit,
middle nodes forbid any further use of this cell type on that circuit,
and enforce that all circuit member relays switch to [FORWARD_ECN_TOR]
or [FORWARD_ECN_RTT_TOR] from then on (depending on how concerned we are
about cheating vs disclosing circuit RTT).

This is my all-around favorite of the options. The only downside is that
we have to choose between disclosing RTT, or risking some cheating, but
since the cheating can now only happen after slow start has finished, it
will provide less speed advantages. If clients cheat with the goal of
causing memory exhaustion at relays, the circuit OOM killer can kick in.

Remix 8: Your idea here!

Anyone have any other old lore to resurrect? Any new ideas?

Bonus B-Side Track: Stream Flow Control [PROPOSAL_CANDIDATE]

All of the systems discussed in this post address the problem of
circuit-level congestion control, but do not perform any stream-level
flow control. While we were prototyping N23, Andreas Krey pointed out
that a single blocked stream could consume the entire circuit window
with unread packets, causing the surprising result that other active
streams on the same circuit will also stall:
https://lists.torproject.org/pipermail/tor-dev/2012-November/004138.html

He then went on to propose some alternatives: XON/XOFF, bundling stream
window updates in circuit SENDMEs, or just killing any blocked streams
that have too many unread cells:
https://lists.torproject.org/pipermail/tor-dev/2012-November/004143.html

XON/XOFF is the simplest option, but we will need heuristics to avoid
excessive XON/XOFF chatter on application streams that frequently block
in normal operation. It also will have RTT/2 delay before it actually
causes the other end to stop sending. If the blocked stream is sending
cells at the full bandwidth of the circuit, this could very well be a
full circuit window worth of cells before the XOFF makes it across.

We can just ack those cells with a circuit-level SENDME even though they
were not delivered to their blocked stream, but we have to be careful to
authenticate what we read in this case, so we do not re-introduce the
unauthenticated SENDME attack:
https://gitweb.torproject.org/torspec.git/tree/proposals/289-authenticated-sendmes.txt

Can we learn anything from the layered flow control in QUIC? It is
excessively complicated:
https://docs.google.com/document/d/1F2YfdDXKpy20WVKJueEf4abn_LVZHhMUMS5gX6Pgjl4/edit

III. Evaluation

First, does anyone have any strong objections to any of the remixes
tagged with PROPOSAL_CANDIDATE? It will save time if we can decide right
from the start that any are completely unacceptable. Bonus points if we
can mod them back into being acceptable, or mod one of the remixes that
are not tagged with PROPOSAL_CANDIDATE into being acceptable.

We also need some criteria to help us determine how we're going to
compare the remaining ones. Here is a condensed [TRACK_LISTING], for
quick review/comparison and criteria brainstorming:
______________________________________________________________________
|        MIX           | CHEATING  | RESPONSE|     SIDE   |  UPGRADE |
|       TRACK          | DETECTION |   TIME  |   CHANNELS |   PATH?  |
|~~~~~~~~~~~~~~~~~~~~~~|~~~~~~~~~~~|~~~~~~~~~|~~~~~~~~~~~~|~~~~~~~~~~|
|BOOTLEG_RTT_TOR       | Exit RTT  |100c+RTT |     RTT    |   Exits  |
|FORWARD_ECN_TOR       | Circ OOM  |   RTT   |    None?   | Full Net |
|FORWARD_ECN_RTT_TOR   | Exit RTT  |   RTT   |     RTT    |Exits;Full|
|BACKWARD_ECN_TOR      |Middles;OOM|  RTT/2  | Guard->Exit| Full Net |
|BACKWARD_ECN_RTT_TOR  |Middles;RTT|RTT/2;RTT| Guard->Exit|Exits;Full|
|BACKWARD_ECN_TRAP_TOR |  Middles  |  RTT/2  |Guard<->Exit| Full Net |
|EDGE_FORWARD_ECN_TOR  |Middles;OOM| 2*RTT/3 |    None?   | Full Net |
|START_BACKWARD_ECN_TOR|Middles;OOM|RTT/2;RTT|  Low/None? | Full Net |
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

What are the things we need to decide if a scheme is acceptable? Here's
a laundry list. Please do add any others you think of:
  - Side channel accounting
  - Interaction with non-data traffic, padding traffic, and leaky-pipes
  - Anonymity analysis as per
    https://www.robgjansen.com/publications/howlow-pets2013.pdf
  - No/minimal cheating
  - No throughput cap
  - Quick responsiveness (at least RTT; RTT/2 is better)
  - Fairness analysis (may be a separate proposal on QoS/EWMA)
  - Onion service deployment
  - Simulation methodology that will reflects live deployment
  - Deployment costs
  - Upgrade path

Must-Read References:

Cultural etymology of Tor's end-of-decade donations campaign aesthetic:
https://en.wikipedia.org/wiki/Vaporwave
https://blog.torproject.org/better-internet-possible-ive-seen-it

Tor congestion latency is proportional to network utilization divided by
spare_capacity:
Section 6.3 of https://www.freehaven.net/anonbib/cache/murdoch-pet2008.pdf

RTT-based Congestion Control for Tor:
Section 4.1 of https://cseweb.ucsd.edu/~savage/papers/PETS11.pdf

TCP ECN: https://tools.ietf.org/html/rfc3168#section-6.1

Backward ECN: https://tools.ietf.org/html/draft-salim-jhsbnns-ecn-00

Stream Flow Control Issues and Ideas:
https://lists.torproject.org/pipermail/tor-dev/2012-November/004138.html
https://lists.torproject.org/pipermail/tor-dev/2012-November/004143.html

Side channels in circ command cells:
https://blog.torproject.org/tor-security-advisory-relay-early-traffic-confirmation-attack

-- 
Mike Perry

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 833 bytes
Desc: OpenPGP digital signature
URL: <http://lists.torproject.org/pipermail/tor-dev/attachments/20200130/587314ed/attachment-0001.sig>