tor-dev
Threads by month
- ----- 2026 -----
- January
- ----- 2025 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2024 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2023 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2022 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2021 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2020 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2019 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2018 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2017 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2016 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2015 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2014 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2013 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2012 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2011 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- 3613 discussions
This novelette-length mail is meant to provide the background
information necessary to understand choices in congestion control
deployment for Tor. Much of this background information won't be
included in my planned Tor Proposal, which will hopefully be much
shorter, but the choices made there will be informed by this material.
This material has been enhanced through review and conversations with
Toke Høiland-Jørgensen, g burdell, Rob Jansen, and George Kadianakis.
Immense credit goes to Toke in particular, for detailed and thoughtful
review, and input on queue management.
Like my last mail on this topic, this mail comes with a playlist. Fair
warning: it is not exactly easy listening. Still, hopefully it gives you
strength and inspiration in these trying times:
https://people.torproject.org/~mikeperry/borders/everybody-loves-them/TakeA…
Section -I: Recap [RECAP]
https://wizardzines.com/zines/networking
I strongly believe that congestion control will provide the single
largest performance win for the Tor network. Specifically, it will
considerably improve both latency and throughput metrics in the best,
average, and worst case. In addition to performance, congestion control
will also provide the second-largest scalability win on our research
horizon (behind Walking Onions), by making the queue length memory
requirements of Tor relays independent of the number of circuits carried
on that relay.
I also believe we can achieve the vast majority of the benefits by
deploying an RTT-based system that only clients and Exits need to
support, specifically BOOTLEG_RTT_TOR. After this, we might consider
more advanced options based on ECN, but their benefits are far less
clear, and fraught with peril.
This mail assumes that you have read "Options for Congestion Control in
Tor-like Networks":
https://lists.torproject.org/pipermail/tor-dev/2020-January/014140.html
Even if you have already read the Options mail, I recommend reviewing
the the SENDME section and the sections marked with TCP_HISTORY,
BOOTLEG_RTT_TOR, BACKWARD_ECN_TOR, and perhaps also FORWARD_ECN_TOR, as
understanding this mail requires an understanding of those sections. The
summary of the other sections from the Options mail is basically this:
adding ECN signals to Tor seems straight-forward, but is actually very
tricky.
Section @: Introduction and summary [INTRO_SUMMARY]
Deep understanding of the congestion control problem space is essential
because shitty congestion control will not be a huge win, and will
reduce the benefit of other performance improvements. Worse still, it
may introduce side channels and other anonymity attacks.
In particular, our level of congestion latency (and associated relay
queue lengths) will be very sensitive to congestion signal response
time, to our congestion window update rules, and also to our choice of
queue management algorithm (ie: which circuits send cells and/or
congestion signals).
Our average circuit bandwidth on the network will similarly be very
sensitive to the ability of the congestion window to quickly update and
converge on the available Bandwidth-Delay Product (BDP) of its path.
In other words, we need to make good design choices in the following
three different areas:
I. Which congestion signal to use [CONGESTION_SIGNALS]
II. How to update congestion window [CONTROL_ALGORITHMS]
III. What circuits to send the signal on [QUEUE_MANAGEMENT]
The next three sections of this mail correspond to those three areas.
As before, I have tagged the specific sub-sections that will be used in
the Tor Proposal with [PROPOSAL_CANDIDATE]. I have also given tags to
section in this document. References to things in this document are
enclosed in brackets. References to the Options mail are simply all
caps, without brackets.
The high-level plan is that we use RTT as a primary congestion signal,
as per BOOTLEG_RTT_TOR from the Options mail, and try it out with a few
different congestion algorithms, such as [TCP_VEGAS_TOR] and
[TCP_WESTWOOD_TOR].
It turns out that when RTT is used as a congestion signal, our current
Circuit-EWMA is sufficient to serve as [QUEUE_MANAGEMENT]. It also turns
out that Circuit-EWMA actually should protect us from fairness issues in
the control algorithms, and from cheaters who try to run different
congestion control algorithms. Circuit-EWMA will also deliver RTT
congestion signals more heavily to bulk circuits (because it prioritizes
their cells below light circuits), causing them to back off in favor of
lighter traffic, even as we relay fewer of their cells. Thanks goes to
Toke Høiland-Jørgensen for pointing this out.
The combination of RTT-based congestion signaling, a decent but simple
control algorithm, and Circuit-EWMA will get us the most if not all of
the benefits we seek, and only requires clients and Exits to upgrade to
use it. Once it is deployed, circuit bandwidth caps will no longer be
capped at ~500kb/sec by the fixed window sizes of SENDME; queue latency
will fall significantly; memory requirements at relays should plummet;
and bottlenecks in the network should dissipate.
However, we can still improve upon this further, after this initial
deployment. Because the "slow start" phase of TCP can take a long time,
we will also want to play around with various initial window sizes,
window growth rate parameters, and experiment with using a single
BACKWARD_ECN_TOR cell_t.command ECN-style signal. This signal will be
sent on circuits as determined by [CODEL_TOR]. An instance of
[CODEL_TOR] can be added to each TLS connection, after Circuit-EWMA is
applied.
While BACKWARD_ECN_TOR is complex and slow to roll out (all relays must
support this new cell_t.command), this optimization will likely be worth
it because BACKWARD_ECN_TOR can send a signal on a circuit to stop the
exponential growth of slow start in *less than* RTT/2 time, which is
significantly faster than implicitly measuring an RTT-based signal
(which takes RTT plus some measurement lag to act upon). This faster
signal means faster termination of slow start, which means less buffer
bloat.
For stream flow control, on the client side, we should just ack all
streams cells with a circuit-level SENDME even though they were not
delivered to their blocked stream. The local Tor client should keep
queuing them until the application reads them. Thus, a blocked client
application stream will not block other streams, but will cause Tor
client stream-level queues to build up. For exit-side streams, we should
*not* send SENDMEs until the cell has been sent on its exiting upstream
connection. This allows TCP pushback on one stream to block the entire
circuit, but this is desirable to avoid queuing, and it will only happen
in the upload direction.
Section I: Selecting Congestion Signals [CONGESTION_SIGNALS]
The "Options for Congestion Control in Tor-like Networks" mail probably
should have been titled "Options for Congestion Signals in Tor-like
Networks", as its primary focus was on the ways Tor endpoints can
receive a signal about the congestion status of a circuit on bottleneck
relays. In that post, I postponed discussion of queue management, and
deliberately under-specified how to react to a congestion signal (going
no further than saying "use SlowStart and AIMD").
I chose that scoping because the most difficult barrier we have faced in
the past decade of work on this topic is the side channels enabled by a
congestion signal (and information leaks in general).
However, I should have made our constraints more explicit. Our choice of
acceptable congestion signal mechanism is restricted by these design
constraints: 1. No side channels between Guard and Exit, or between
circuits 2. Incremental upgrade path 3. Fast response time 4. Cheater
prevention
These design constraints restrict our choice of congestion signal, but
they are not our only design constraints in this problem space. When we
get to the later sections on window control update algorithm, and queue
management algorithm, I will explicitly enumerate additional design
constraints there. Hopefully this will provide more clarity than the
uncategorized list of "causes of death" in the Options post.
For your convenience, here's the summary table of all the options for
congestion signal again:
______________________________________________________________________
| MIX | CHEATING | RESPONSE| SIDE | UPGRADE |
| TRACK | DETECTION | TIME | CHANNELS | PATH? |
|~~~~~~~~~~~~~~~~~~~~~~|~~~~~~~~~~~|~~~~~~~~~|~~~~~~~~~~~~|~~~~~~~~~~|
|BOOTLEG_RTT_TOR | Exit RTT |100c+RTT | RTT | Exits |
|FORWARD_ECN_TOR | Circ OOM | RTT | None? | Full Net |
|FORWARD_ECN_RTT_TOR | Exit RTT | RTT | RTT |Exits;Full|
|BACKWARD_ECN_TOR |Middles;OOM| RTT/2 | Guard->Exit| Full Net |
|BACKWARD_ECN_RTT_TOR |Middles;RTT|RTT/2;RTT| Guard->Exit|Exits;Full|
|BACKWARD_ECN_TRAP_TOR | Middles | RTT/2 |Guard<->Exit| Full Net |
|EDGE_FORWARD_ECN_TOR |Middles;OOM| 2*RTT/3 | None? | Full Net |
|START_BACKWARD_ECN_TOR|Middles;OOM|RTT/2;RTT| Low/None? | Full Net |
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
My short-term favorite is BOOTLEG_RTT_TOR, because it only requires
Exit relay support to use.
Recall that BOOTLEG_RTT_TOR defines an RTT latency threshold as its
congestion signal, and was first described in section 4.1 of the N23
Defenestrator paper. Even though the N23 paper didn't bother to name
this system, RTT use as a congestion signal in this way also resembles
TCP Vegas, as we shall see.
https://www.cypherpunks.ca/~iang/pubs/defenestrator.pdf
Note that because Circuit-EWMA will add additional delay to loud
circuits, "cheaters" who use alternate congestion control algorithms
should end up with more delay than those who do not. This means that if
we use RTT as a congestion signal, Circuit-EWMA should provide *more*
RTT congestion signal to cheaters, and cheaters would only be punishing
themselves with more Circuit-EWMA delay. So cheater resilience for an
RTT signal is actually more well handled that I previously thought.
Again, thanks goes to Toke Høiland-Jørgensen for pointing this out.
For completeness, we will still explore some examples of alternate
"cheating" control algorithms in Section II, though.
Unfortunately, RTT is itself a side channel, as per:
https://www.freehaven.net/anonbib/cache/ccs07-latency-leak.pdf
https://www.robgjansen.com/publications/howlow-pets2013.pdf
There is also literature on shaping circuit bandwidth to create a side
channel. This can be done regardless of the use of congestion control,
and is not an argument against using congestion control. In fact, the
Backlit defense may be an argument in favor of endpoints monitoring
circuit bandwidth and latency more closely, as a defense:
https://www.freehaven.net/anonbib/cache/ndss09-rainbow.pdf
https://www.freehaven.net/anonbib/cache/ndss11-swirl.pdf
https://www.freehaven.net/anonbib/cache/acsac11-backlit.pdf
Recall that BACKWARD_ECN_TOR used circuit-level cell_t.command to signal
congestion. This allows all relays in the path to signal congestion in
under RTT/2 in either direction, and it can be flipped on existing relay
cells already in transit, without introducing any overhead. However,
because cell_t.command is visible and malleable to all relays, it can
also be used as a side channel. So we must limit its use to a couple of
cells per circuit, at most.
Recall that FORWARD_ECN_TOR adds encrypted end-to-end relay cells that
explicitly signal congestion to either the client or the Exit, but
requires the client to mediate and echo this cell to protect against
side channels. Because congestion signals must be relayed off the
client, it still takes the Exit a full RTT to learn about congestion at
earlier nodes in the circuit. Additionally, because this cell itself
must travel the full path in each direction, it is further delayed by
congestion on the path, and it also adds yet more congestion to
congested paths. Congestion in Exit to client direction is the most
common, as this is the download direction. It will be particularly
expensive and slow to bounce the congestion signal off the client to the
Exit.
For this reason, I think all forms of FORWARD_ECN_TOR are not likely to
be as efficient as RTT signaling, despite the FORWARD_ECN_RTT_TOR
variant initially being flagged as a PROPOSAL_CANDIDATE in the Options
mail. However, we should still keep it in mind, since RTT may turn out
to be an unreliable signal by itself.
Therefore, since RTT is already measurable with our current SENDME flow
control, I propose that we use it via BOOTLEG_RTT_TOR (or TCP Vegas),
and then add START_BACKWARD_ECN_TOR only exactly once (to exit slow
start). If RTT proves unreliable, then we can experiment with the more
heavyweight FORWARD_ECN_TOR full-cell signal.
As we shall see in the following control algorithm section, RTT-based
systems like TCP Vegas largely avoid the need to send any congestion
signal (drop or ECN). In other words: if we can rely on RTT, we should
not need any explicit signals at all (though they would improve
responsiveness).
Section II: Congestion Window Control Algorithms [CONTROL_ALGORITHMS]
Recall that the congestion window is the number of packets TCP will
allow to be sent on a connection without any acknowledgment. The
congestion window is meant to match the bandwidth-delay product of the
connection as close as possible. The bandwidth-delay product can be
thought of as the total amount of data that can be in-transit on a link
at one time without using any queue memory (bytes per second of
bandwidth multiplied by seconds of transit time delay = total bytes in
transit on the wire at one time).
On the Internet, if the congestion window exceeds the bandwidth-delay
product of a transmission path, then packets must build up in router
queues, and eventually get dropped. If the congestion window is below
the bandwidth-delay product of the path, then all routers on the path
spend time unused and idle, and thus utilization suffers, even if users
are trying to use full bandwidth.
TCP variants basically compete on their ability to collectively track
the bandwidth-delay product of a path under a wider of a variety of
network conditions and other competing traffic.
In the interest of brevity, my "Options for Congestion Control in
Tor-like Networks" mail suggested that we update the congestion window
according to Slow Start with AIMD, in response to a congestion signal
(or lack thereof).
However, that is not the only option, and if we retain the ability to
measure RTT, it may not even be the most efficient one. So to help us
make this decision, I did a review of historical TCP and TCP-like
systems in the research literature, and will present the highlights
here.
But first, let's define the control requirements that we want to satisfy:
1. Full bandwidth utilization of slowest TLS link in every circuit
2. Queue length minimization at the relay sending on this TLS link
3. Fairness between circuits (esp partially overlapping ones)
4. Rapid convergence to full circuit capacity
These requirements mirror the requirements used in TCP evaluation. TCP
implementations compete over metrics in each of these areas, and
algorithm design is far more art than formally proved optimal or even
correct.
That does not mean that formalism is impossible, however. For a great
formal breakdown of this design space without all of TCP's baggage, see
Section 3 of the XCP paper. Note that we can't use XCP as-is because its
congestion signal fields provide way too much information for use in
potential side-channels, but it is worth a read because it clarifies the
control theory behind congestion control:
http://conferences.sigcomm.org/sigcomm/2002/papers/xcp.pdf
In terms of how this design space is approached in TCP: utilization and
queue minimization are typically handled simultaneously - sometimes
explicitly and sometimes implicitly. Similarly, fairness is sometimes
explicitly designed for, but more often fairness is an emergent property
that is analyzed when competing against a benchmark (historically: TCP
Reno). Fairness abuse is also an area where cheating can occur that
enables cheaters to get more than their fair share of bandwidth, and
such cheating can be very hard to detect and prevent on the Internet.
Interestingly, many of the design goals of TCP are *also* improved via
queue management algorithms at routers (Section III of this mail). In
fact, our Circuit-EWMA circuit scheduler will effectively hand out RTT
congestion signals (delay) to loud circuits before quiet ones, and will
*also* enforce fairness and impede cheating, by delaying a cheater's
cells even more.
In this sense, we do not need to worry quite so much about fairness and
cheating as security properties, but a lack of fairness at the TCP
algorithm will *still* increase memory use in relays to queue these
unfair/loud circuits. So we should still be mindful of these properties
in selecting our congestion algorithm, to minimize relay memory use, if
nothing else.
For a good summary of several TCP congestion control algorithms, see:
http://intronetworks.cs.luc.edu/1/html/reno.html ("Slow Start+AIMD")
http://intronetworks.cs.luc.edu/1/html/newtcps.html (modern TCPs)
As you can see, the TCP design space is huge.
A good way to break down the TCP space is to separate the drop-based
algorithms from the delay-based algorithms. TCP drop-based algorithms
require either packet drops or ECN packet marking to signal congestion.
Delay based algorithms infer congestion from RTT measurements only.
But as we shall see, this distinction is artificial for us.
BOOTLEG_RTT_TOR effectively converts delay *into* an explicit congestion
signal, as far as congestion window control algorithms are concerned.
Similarly, either of our ECN signals can be used in place of the drop
signal in the algorithms below.
I'm going to review four options that strike me as most relevant to our
needs. Two are drop-based: TCP Reno, and TCP CUBIC. Two are
delay-based: TCP Vegas, TCP Westwood.
It should be noted that there are also hybrid algorithms that use
drops/ECN in addition to delay as congestion signals, but since we are
focusing on primarily using RTT only as our signal, I won't be covering
them. For a good overview of those options, see:
https://arxiv.org/pdf/1903.03852.pdf
Toke Høiland-Jørgensen suggested that we also investigate BBR, which is
a hybrid algorithm currently being evaluated by DropBox and Google.
However, BBR involves packet send time scheduling, which will require
replacement of Circuit-EWMA and other extensive changes. Furthermore,
BBR is still under active development, experimentation, and tuning. In
fact, while v2 has been tested by Dropbox, no IETF draft specification
exists yet. Still, for completeness:
https://tools.ietf.org/html/draft-cardwell-iccrg-bbr-congestion-control-00
https://queue.acm.org/detail.cfm?id=3022184
https://datatracker.ietf.org/doc/slides-106-iccrg-update-on-bbrv2/
https://dropbox.tech/infrastructure/evaluating-bbrv2-on-the-dropbox-edge-ne…
I am also going to simplify and omit treatment of packet loss and
duplicate acks, since we will not be dropping any packets. Additionally,
I have normalized all formulas below to be in terms of Tor cells, rather
than bytes.
The hope is that these simplifications help us get to the essence of
what these algorithms would look like when deployed on Tor, without
getting distracted by irrelevant drop and ack synchronization management
details from TCP/IP.
However, as we move closer to specifying an actual design proposal, we
should double-check my simplifications below, for the algorithm(s) that
we actually specify.
Section II.A: TCP Reno [TCP_RENO_TOR]
https://tools.ietf.org/html/rfc5681
http://intronetworks.cs.luc.edu/1/html/reno.html
Until recently, TCP Reno was the most popular TCP on the Internet, and
it is still used as a fairness benchmark. While we won't be using it
directly, it is thus important to understand.
Like all TCPs, TCP Reno satisfies its utilization and queue minimization
design goals by trying to make its congestion window closely match the
bandwidth-delay product of the link. More importantly, when multiple TCP
connections are present competing for the bandwidth of a link, TCP Reno
fairly distributes the bandwidth-delay product of the link equally among
all competing TCP Reno connections.
http://pbg.cs.illinois.edu/courses/cs598fa09/slides/05-Ashish.pdf
https://www.cse.wustl.edu/~jain/papers/ftp/cong_av.pdf
TCP Reno probes the bandwidth-delay product by increasing its congestion
window until it overflows the the queue capacity of the slowest router
on a path, causing a packet drop or ECN signal, upon which TCP Reno
reduces its congestion window.
The initial phase of TCP Reno is called "Slow Start". For our purposes,
Slow Start in TCP Reno means that every SENDME ack, the congestion
window cwnd becomes:
cwnd = cwnd + SENDME_RATE (recall Tor's SENDME_RATE is 100 cells)
Because increasing the congestion window by 100 every SENDME allows the
sender to send 100 *more* 512 byte cells every 100 cells, this is an
exponential function that causes cwnd to double every cwnd cells. TCP
Reno keeps on doubling until it experiences a congestion signal.
Once a congestion signal is experienced, Slow Start is exited, and the
Additive-Increase-Multiplicative-Decrease (AIMD) steady-state phase
begins. AIMD algorithms can be generalized with two parameters: the add
quantity, and the multiply quantity. This is written as AIMD(a,m). TCP
Reno uses AIMD(1, 0.5). It should be noted that most later TCP's
include an AIMD component of some kind, but tend to use 0.7 for m
instead of 0.5. This allows them to recover faster from congestion.
After slow start exits, in steady-state, after every ack (SENDME)
response without a congestion signal, the window is updated as:
cwnd = cwnd + SENDME_RATE/cwnd
This comes out to increasing cwnd by 1, every time cwnd cells are
successfully sent without a congestion signal occurring. Thus this is
additive linear growth, not exponential growth.
If there was a congestion signal, cwnd is updated as:
cwnd = cwnd * m (m=0.5 or 0.7)
This is called "Fast Recovery". If you dig into the RFCs, actual TCP
Reno has some additional details for responding to multiple packet
losses that in some cases can fall back into slow-start. However, for
our purposes, the congestion window should only be updated once per cwnd
cells, in either the congestion case, or the non-congestion case.
TCP Reno (and all TCPs) also revert to slow start when the connection
has been idle, in case bottleneck capacity has changed in this time. We
may or may not want to do this, as slow start has a dramatic impact on
page load times.
TCP Reno is still used as the benchmark to determine fairness. If an
algorithm is "TCP Friendly", this basically means that when competing
with TCP Reno to use spare path capacity, it does not get its ass
kicked, and it does not kick Reno's ass. Several algorithms contain
hacks or special cases to preserve this property. If this fairness
property holds, a bunch of math can prove convergence rates,
utilization, congestion window sizes, and even queue sizes. Modern
queue management also relies on this math to auto-tune queue parameters,
as we will see in Section III.
http://pbg.cs.illinois.edu/courses/cs598fa09/slides/05-Ashish.pdf
https://www.icir.org/tfrc/aimd.pdf
Section II.B: TCP Cubic [TCP_CUBIC_TOR]
http://intronetworks.cs.luc.edu/1/html/newtcps.html#tcp-cubic
https://pandorafms.com/blog/tcp-congestion-control/
https://tools.ietf.org/html/rfc8312
TCP Reno unfortunately does not perform optimally for networks where the
natural packet loss rate starts to become frequent, which happens for
both high bandwidth and high latency networks. This shortcoming has lead
to a huge explosion of competition among replacements and improvements
for this use case, and TCP Cubic has won out (for now).
TCP Cubic basically takes TCP Reno and tweaks the additive portion of
AIMD behavior to be a cubic function (y=x^3) instead of linearly
additive. The cubic function was chosen because it is concave, then
flat, and then convex. The concave piece allows for quickly recovering
after congestion (or non-congestion-related packet loss), and the convex
piece allows for gently probing for fresh spare link capacity.
Because of adoption by both the Linux kernel and the Windows kernel as
the default TCP, TCP Cubic is now the most popular TCP on the Internet.
Unfortunately, TCP Cubic is complicated. It has multiple magic numbers,
and is defined piece wise depending on how its congestion window would
compare to the equivalent TCP Reno window (to ensure the "TCP Friendly"
property -- without this hack, TCP Cubic is "too polite" and TCP Reno
eats its lunch). It additionally has time-based components that must be
clocked relative to current measured RTT (though it does not use RTT as
a congestion signal).
The cubic congestion window is defined as:
W_max = previous window size before last congestion signal
W_cubic(t) = C*(t-K)^3 + W_max
With magic number constants:
C = 0.4
beta_cubic = 0.7
K = cubic_root(W_max*(1-beta_cubic)/C)
To maintain competitiveness with TCP Reno, Cubic keeps a running
estimate of an equivalent TCP Reno's window size, using the AIMD math
cited for Reno above:
W_est(t) = W_max*beta_cubic +
[3*(1-beta_cubic)/(1+beta_cubic)] * (t/RTT)
if W_est(t) > W_cubic(t):
cwnd = W_est(t)
else:
cwnd = cwnd + (W_cubic(t+RTT) - cwnd)/cwnd
Upon congestion signal, Cubic uses its beta_cubic parameter (0.7), and
thus does not back off quite as far as Reno:
W_max = cwnd
cwnd = cwnd * beta_cubic
It is pretty easy to convince yourself that Cubic will do as least as
well as TCP Reno (thanks to the AIMD Reno window comparison hack), but
it is very hard to conceptualize the above and convince yourself that it
is the best we could possibly do for our use case. Especially since Tor
is *not* affected by an increasing inherent drop rate as bandwidth
increases, unlike the Internet.
For these reasons, the complexity of TCP Cubic does not strike me as
worthwhile, as compared to other, simpler candidates.
Section II.C: TCP Vegas and [TCP_VEGAS_TOR] [PROPOSAL_CANDIDATE]
http://pages.cs.wisc.edu/~akella/CS740/F08/740-Papers/BOP94.pdf
ftp://ftp.cs.princeton.edu/techreports/2000/628.pdf (lol ftp, sorry)
Similar to BOOTLEG_RTT_TOR from my Options mail, TCP Vegas was designed
to make use of RTT and bandwidth estimation to avoid queue congestion,
and thus avoid any dropped packets or ECN signals. However, TCP Vegas
actually directly estimates queue length of the bottleneck router, and
aims to minimize the packets in that queue.
To accomplish this, TCP Vegas maintains two RTT measurements:
RTT_current and RTT_min. It also maintains a bandwidth estimate.
The bandwidth estimate is the current congestion window size divided by
the RTT estimate:
BWE = cwnd/RTT_current
(Note: In later TCP Vegas relatives like TCP Westwood, this bandwidth
estimate was improved to use an explicit count of the unacked packets in
transit, instead of cwnd).
Recall that in order to utilize the full capacity of the link, all TCPs
attempt to maintain its congestion window as close to the
bandwidth-delay product (ie: the "memory capacity") of the transmission
path as possible.
The extra queue use along a path can thus be estimated by first
estimating the path's bandwidth-delay product:
BDP = BWE*RTT_min
TCP Vegas then estimates the queue use caused by congestion as:
queue_use = cwnd - BDP
= cwnd - cwnd*RTT_min/RTT_current
= cwnd * (1 - RTT_min/RTT_current)
If queue_use drops below a threshold alpha (typically 2-3 packets for
TCP, but perhaps double or triple that for our smaller cells), then the
congestion window is increased. If queue_use exceeds a threshold beta
(typically 4-6 packets, but again we should probably double or triple
this), then the congestion window is decreased. This update happens only
once per cwnd packets:
if queue_use < alpha:
cwnd = cwnd + 1
elif queue_use > beta:
cwnd = cwnd - 1
As a backstop, TCP Vegas still will half its congestion window in the
event that a congestion signal (packet drop or ECN) still occurs.
However, such packet drops should not actually even happen, unless Vegas
is fighting with a less fair control algorithm that is causing drops
instead of backing off based on queue delay.
Vegas also uses this queue_use estimate to govern its initial slow start
behavior. During slow start, the Vegas congestion window grows
exponentially, but only half as fast as Reno (to avoid excessive
congestion). For us, this means that cwnd is updated every SENDME ack
by:
if queue_use < gamma:
cwnd = cwnd + SENDME_RATE/2
This exponential increase continues until the queue_use estimate exceeds
"gamma" (which is determined experimentally and sadly not listed in the
TCP Vegas paper). If I had to guess, gamma is probably close to beta
(4-6 packets).
Because Vegas tries much harder to avoid congestion than TCP Reno, Vegas
is vulnerable to cheating through fairness abuse. When TCP Vegas
competes against a TCP Reno-style TCP that responds only to packet drops
(or ECN signals in our case) and ignores RTT, TCP Vegas detects
congestion earlier than TCP Reno would, and ends up yielding bandwidth
to TCP Reno. This means that Vegas does *not* have the "TCP Friendly"
property that modern AQM algorithms depend on for their
parameterization.
Note that we should expect the same thing to happen if TCP Vegas
competed with BOOTLEG_RTT_TOR algorithm run by cheaters. However, since
RTT is being used as the congestion signal in both cases, Circuit-EWMA
should still balance this out, but we should experiment with this to
confirm. There may still be additional queuing memory costs in the
presence of cheaters, compared to either pure [TCP_VEGAS_TOR] or pure
BOOTLEG_RTT_TOR. Even so: this form of cheating is also pretty easy for
honest Exit nodes to detect, and cheating clients only get benefit in
the upload direction, even if they did cheat.
To see this a bit more concretely, if we sub in TCP Reno in
BOOTLEG_RTT_TOR, during slow start we have:
T=(1−a)*rtt_min+a*rtt_max
If rtt < T:
cwnd = cwnd + SENDME
else:
exit slow start
And then for steady state:
If rtt < T:
cwnd = cwnd + SENDME_RATE/cwnd
else:
cwnd = cwnd * 0.5
The 'a' parameter from BOOTLEG_RTT_TOR is operating similarly to the
queue_len criterion of [TCP_VEGAS_TOR], but the window update rules are
[TCP_RENO_TOR], which are more aggressive than TCP Vegas.
So effectively, this is still [TCP_VEGAS_TOR] with a different target
condition for queue_len. This means we could expect BOOTLEG_RTT_TOR to
drive the congestion to some higher rate of queue use than
[TCP_VEGAS_TOR], and senders who use it may out-compete Vegas much like
TCP Reno out-competes Vegas, until Circuit-EWMA starts punishing them
(at the expense of more queue memory waste).
This also shows us a crucial testing consideration: [TCP_VEGAS_TOR] may
not compete well with our current SENDME congestion control, since it's
ability to set queue_target may cause it to de-prioritize itself below
competing SENDME flows. BOOTLEG_RTT_TOR, on the other hand, will allow
us to set its 'a' parameter at whatever point makes it competitive with
SENDME. For this reason, BOOTLEG_RTT_TOR may yield better performance
characteristics until all clients upgrade.
Section II.D: TCP Westwood [TCP_WESTWOOD_TOR] [PROPOSAL_CANDIDATE]
https://tools.ietf.org/html/rfc8290#section-4.1
http://intronetworks.cs.luc.edu/1/html/newtcps.html#tcp-westwood
TCP Westwood is essentially TCP Reno, but if the current BDP estimate is
greater than half the current congestion window, TCP Westwood uses that
instead of halving the congestion window, during a congestion signal.
In other words, upon a congestion signal, TCP Westood does:
cwnd = max(cwnd * 0.5, BDP_estimate)
This has the effect of recovering from congestion events much quicker
than TCP Reno, and thus utilizing more of the path capacity.
Westwood+ also has some improvements on TCP Vegas's BDP_estimate in
order to smooth it in the presence of delayed and/or dropped acks:
BWE_k = a*BWE_k-1 + (1−a)*BWM_k
We may not need this smoothing as much, because we do not have any
dropped acks, and dropped acks were far more damaging to TCP Vegas than
slightly delayed acks.
Section III: Queue Management Algorithms [QUEUE_MANAGEMENT]
On the Internet, Queue Management Algorithms are used by routers to
decide how long their queues should be, which packets they should drop
and when, and which connections to send ECN signals on and when. Queue
management is almost as large of a research area as congestion control.
Thanks to bittorrent and video streaming, queue management also grew to
consider fairness or QoS between flows. This is done by separating
queues or imposing a scheduling algorithm on packet ordering. For an
excellent overview of fairness vs queue management, see RFC 7806:
https://tools.ietf.org/html/rfc7806
Since Tor relays cannot drop packets, our design goals are slightly
different than Internet queue management. We care about the following
design goals:
1. Prioritize well-behaved/light circuits over loud/heavy ones
2. Decide when to send ECN signals
3 Decide which circuits to send ECN signals on
4. Detect which circuits are queuing too much (or otherwise cheating)
Recall that Tor has already deployed Circuit-EWMA in order to meet goal
#1. Circuit-EWMA tries to give priority to less loud circuits in the
circuit scheduler, so that interactive applications experience less
latency than bulk transfers.
https://www.cypherpunks.ca/~iang/pubs/ewma-ccs.pdf
Circuit-EWMA can be viewed as a fairness algorithm as per RFC 7806
Section 2, since it schedules flows independently with the target
fairness property of prioritizing quiet circuits cells over loud ones.
When Circuit-EWMA is used in combination with RTT as a congestion
signal, it is effectively *also* a marking algorithm. This is because it
*also* distributes RTT congestion signals more frequently to loud
circuit:
https://tools.ietf.org/html/rfc7806#section-2
https://tools.ietf.org/html/rfc7806#section-3
Recall that Tor also had to deploy KIST, to move queues from the Linux
kernel into Tor, for Circuit-EWMA to properly prioritize them. See
Section 4 and 5 of the full-length KIST paper for details on this
interaction, and see page 5 for a good graphical depiction of all the
queuing points in Tor and the Kernel:
https://matt.traudt.xyz/static/papers/kist-tops2018.pdf
It is likely the case that Circuit-EWMA and KIST are all we need if we
only use RTT signaling. But if we add in ECN, at minimum we will need to
update Circuit-EWMA to inform us which circuit(s) to send ECN signals
on. In fact, thanks to RFC7806's distinction between fairness and
management, modern Internet queue management algorithms compose well
with Circuit-EWMA, allowing us to still use Circuit-EWMA for the
fairness portion, but use an existing Internet algorithm for the ECN
marking. So it is useful to review Internet history here, as well as
current state-of-the-art.
On the Internet, queue management algorithms arose fairly early for one
primary reason: early routers performed what is called "tail dropping".
Tail dropping means that routers allowed their queues to get full, and
then started dropping all packets that arrived while queue remained
full. This has all kinds of negative consequences, but the easiest to
intuitively understand is that if you wait until your queue is full
before you start dropping packets, and then drop all of the packets that
arrive from then on, competing TCP connections will all back off at the
same time, leading to lulls in network utilization.
The key innovation to address tail dropping was to randomly drop (or ECN
mark) packets early, before the queue was full, with a per-packet drop
probability. This probability was computed based on many factors of the
link, and sometimes based on individual TCP connection tracking.
Until very recently, both queue management and fair queuing algorithms
tended to have many knobs that must be tuned for specific kinds of links
and load levels. The Internet seems to be converging on three main
contenders for "knobless" queue management: PIE, ARED, and CoDel. Here's
some recent review material:
http://content.ikr.uni-stuttgart.de/Content/Publications/Archive/Wa_EUNICE_…
https://datatracker.ietf.org/meeting/88/materials/slides-88-iccrg-4/
However, be aware that these knobless systems make assumptions about the
congestion control algorithms and the latency of the connections that
may not apply to us. The assumptions are most clearly specified in
Section 3.2 of the CoDel RFC, and are closely related to the "TCP
Fairness vs TCP Reno" assumption from our congestion control review:
https://tools.ietf.org/html/rfc8289#section-3.2
Ok, so let's review the top "knobless" contenders.
Section III.A: PIE and ARED [ARED_PIE_TOR]
https://tools.ietf.org/html/rfc8033
http://www.icir.org/floyd/papers/adaptiveRed.pdf
As a quick summary, both PIE and ARED manage their queues by dropping
packets probabalistically. In both systems, the per-packet drop
probability is updated in proportion to average queue transit time --
longer queues mean the drop probability is increased.
Both systems decide to drop individual packets from the queue in using
this probability. This has the effect that louder flows (with more
packets) are also more likely to experience drops (or ECN congestion
marking). But otherwise they have no explicit notion of fairness.
This makes them marking algorithms as per RFC7806's classification. If
explicit QoS or fairness is desired, they have to be deployed in
combination with a fairness algorithm to prioritize between separate
queues, again as per RFC 7806 Section 3.3:
https://tools.ietf.org/html/rfc7806#section-3.3
Recall that Tor has multiple levels of queues: Circuit, TLS connections,
and kernel buffers, as per Figure 1 on page 5 in the KIST paper:
https://matt.traudt.xyz/static/papers/kist-tops2018.pdf
Even if we keep Circuit-EWMA for scheduling (and thus fairness), we can
still add in PIE or ARED per each TLS connection. If the average total
queue length for all circuits on a TLS connection grows beyond the
marking threshold for PIE/ARED, we could then begin sending congestion
signals on circuits, as they send cells, as per the cell drop
probability of these algorithms.
However, because we only want to send one BACKWARD_ECN_TOR signal per
circuit, we will have to save these signals for circuits that have not
yet gotten one. It is an open question how we should manage the mark
probability if we have to pass over a circuit because we already sent a
backward ECN signal on it.
Section III.B: CoDel and FQ_CoDel [CODEL_TOR] [PROPOSAL_CANDIDATE]
https://queue.acm.org/detail.cfm?id=2209336
https://tools.ietf.org/html/rfc8289
https://tools.ietf.org/html/rfc8290
CoDel is a relatively recent algorithm that is gaining traction because
of its simple elegance. Like PIE and ARED, CoDel itself is a marking
algorithm, and can be combined with any fairness/QoS algorithm. It also
comes with its own fairness algorithm, called FQ_CoDel.
https://tools.ietf.org/html/rfc7806#section-3.2
Despite being touted as parameterless, CoDel does have two parameters:
inspection_interval <- set above max RTT (but not too high)
min_queue_target <- set to 5% of min RTT
CoDel tags packets with timestamps as soon as they enter the queue.
Every inspection_interval, it examines the queue to see if any packets
have waited longer than target_delay. If they have, it begins dropping
packets (sending ECN signals in our case).
Because CoDel is accurately tracking the time *each* packet spends in
the queue, it is able to differentiate between "good" queues (not
congested), vs "bad" queues (congested), on a packet-by-packet basis.
This good vs bad distinction is determined by comparing the packet
transit time to min_queue_target. If a packet exceeds this transit time,
then the queue is bad, and the packet must be dropped or ECN marked.
It is this per-packet distinction that significantly separates CoDel
from PIE and ARED, which both use average queue times. This was the
source of some controversy, as per-packet timestamping was initially
argued to be too expensive. But on modern commodity hardware, this
per-packet timestamping is actually incredibly efficient. It also allows
CoDel to converge on the min_queue_target very rapidly, with high link
utilization.
However, it would seem that because we can't drop packets, sojourn times
will *not* decrease down to min_queue_target immediately upon congestion
for Tor. We will have to wait for the actual backoff signal to
propagate (1 RTT) before the queue size changes due to backoff.
According to Toke Høiland-Jørgensen (one of the FQ_CoDel spec authors),
this should not be that significant, but it should be tested.
FQ_CoDel extends CoDel to provide fairness in a very similar way as
Circuit-EWMA does, by separating out TCP connections into "generated a
queue recently" and "did not generate a queue recently". This allows
packets from flows that are not causing congestion to be sent more
quickly than flows that are. This also has the side effect that when
connections build long FQ queues, their packets spend more time in
queues, and have an increased probability of being ECN marked or
dropped.
https://ieeexplore.ieee.org/document/8469111
Like CoDel, FQ_CoDel relies on the assumption that it can drop a lot of
packets at once to correct queue sizes instantly:
https://tools.ietf.org/html/rfc8290#section-4.1
This makes me think we should not rush to replace Circuit-EWMA with
FQ_CoDel, but we may get benefit from doing CoDel on a per-TLS
connection basis, just as is described for PIE and ARED above. In this
case, it is more clear which circuits to ECN mark. Whenever a cell has a
Circuit-EWMA circuitmux queue transit time longer than the target queue
time threshold (min_queue_target), that circuit gets an ECN mark. If it
has already been ECN marked, no biggie. In other words: The decision to
ECN mark circuits is completely dependent only on how long their cells
traverse our queues, and does not require any capacity estimates or
other measurements. Very clean.
So if we decide we want to try ECN, I think Circuit-EWMA with CoDel is a
smart first choice. We would then use CoDel to deliver exactly one
BACKWARD_ECN_TOR signal when it detects a circuit's queues are too long.
This RTT/2 signal should cause the circuit to exit slow start more
quickly than RTT measurement by endpoints will.
Section IV: Next Steps [NEXT_STEPS]
After this extended review, I am convinced that an RTT-based approach
can provide a comprehensive solution to all of our design requirements,
and can be enhanced based on experimentation and further additions. Such
an approach also only requires clients and Exits (or just clients and
onion services) to upgrade to initially deploy.
However, we can still improve on this with a few enhancements. In
particular, we can compare [TCP_VEGAS_TOR] to [TCP_WESTWOOD_TOR] to
BOOTLEG_RTT_TOR, and see how they perform in competition with each
other, and in competition with Tor's currently fixed-size SENDME
windows. We can also implement CoDel in circuitmux to track transit
times and determine which circuits to send a single BACKWARD_ECN_TOR
signal, to exit slow start on circuits more quickly, and we can play
around with higher initial window sizes. We can do this experimentation
first in Shadow, and also on the live network via consensus parameters.
The immediate next steps here are to create a proper Tor Proposal from
the sections tagged [PROPOSAL_CANDIDATE] above.
--
Mike Perry
1
0
02 Jun '20
Filename: xxx-flashflow.txt
Title: FlashFlow: A Secure Speed Test for Tor (Parent Proposal)
Author: Matthew Traudt, Aaron Johnson, Rob Jansen, Mike Perry
Created: 23 April 2020
Status: Draft
1. Introduction
FlashFlow is a new distributed bandwidth measurement system for Tor that
consists of a single authority node ("coordinator") instructing one or
more measurement nodes ("measurers") when and how to measure Tor relays.
A measurement consists of the following steps:
1. The measurement nodes demonstrate to the target relay permission to
perform measurements.
2. The measurement nodes open many TCP connections to the target relay
and create a one-hop circuit to the target relay on each one.
3. For 30 seconds the measurement nodes send measurement cells to the
target relay and verify that the cells echoed back match the ones
sent. During this time the relay caps the amount of background
traffic it transfers. Background and measurement traffic are
handled separately at the relay. Measurement traffic counts towards
all the standard existing relay statistics.
4. For every second during the measurement, the measurement nodes
report to the authority node how much traffic was echoed back. The
target relay also reports the amount of per-second background
(non-measurement) traffic.
5. The authority node sums the per-second reported throughputs into 30
sums (one for each second) and calculates the median. This is the
estimated capacity of the relay.
FlashFlow performs a measurement of every relay according to a schedule
described later in this document. Periodically it produces relay
capacity estimates in the form of a v3bw file, which is suitable for
direct consumption by a Tor directory authority. Alternatively an
existing load balancing system such as Simple Bandwidth Scanner could be
modified to use FlashFlow's v3bw file as input.
It is envisioned that each directory authority that wants to use
FlashFlow will run their own FlashFlow deployment consisting of a
coordinator that they run and one or more measurers that they trust
(e.g. because they run them themselves), similar to how each runs their
own Torflow/sbws. Section 5.2 of this proposal describes long term plans
involving multiple FlashFlow deployments.
FlashFlow is more performant than Torflow: FlashFlow takes 5 hours to
measure the entire existing Tor network from scratch (with 3 Gbit/s
measurer capacity) while Torflow takes 2 days; FlashFlow measures relays
it hasn't seen recently as soon as it learns about them (i.e. every new
consensus) while Torflow can take a day or more; and FlashFlow
accurately measures new high-capacity relays the first time and every
time while Torflow takes days/weeks to assign them their full fair share
of bandwidth (especially for non-exits). FlashFlow is more secure than
Torflow: FlashFlow allows a relay to inflate its measured capacity by up
to 1.33x (configured by a parameter) while Torflow allows weight
inflation by a factor of 89x [0] or even 177x [1].
After an overview in section 2 of the planned deployment stages, section
3, 4, and 5 discuss the short, medium, and long term deployment plans in
more detail.
2. Deployment Stages
FlashFlow's deployment shall be broken up into three stages.
In the short term we will implement a working FlashFlow measurement
system. This requires code changes in little-t tor and an external
FlashFlow codebase. The majority of the implementation work will be
done in the short term, and the product is a complete FlashFlow
measurement system. Remaining pieces (e.g. better authentication) are
added later for enhanced security and network performance.
In the medium term we will begin collecting data with a FlashFlow
deployment. The intermediate results and v3bw files produced will be
made available (semi?) publicly for study.
In the long term experiments will be performed to study ways of using FF
v3bw files to improve load balancing. Two examples: (1) using FF v3bw
files instead of sbws's (and eventually phasing out torflow/sbws), and
(2) continuing to run sbws but use FF's results as a better estimate of
relay capacity than observed bandwidth. Authentication and other
FlashFlow features necessary to make it completely ready for full
production deployment will be worked on during this long term phase.
3. FlashFlow measurement system: Short term
The core measurement mechanics will be implemented in little-t tor, but
a separate codebase for the FlashFlow side of the measurement system
will also be created. This section is divided into three parts: first a
discussion of changes/additions that logically reside entirely within
tor (essentially: relay-side modifications), second a discussion of the
separate FlashFlow code that also requires some amount of tor changes
(essentially: measurer-side and coordinator-side modifications), and
third a security discussion.
3.1 Little-T Tor Components
The primary additions/changes that entirely reside within tor on the
relay side:
- New torrc options/consensus parameters.
- New cell commands.
- Pre-measurement handshaking (with a simplified authentication
scheme).
- Measurement mode, during which the relay will echo traffic with
measurers, set a cap on the amount of background traffic it
transfers, and report the amount of transferred background traffic.
3.1.1 Parameters
FlashFlow will require some consensus parameters/torrc options. Each has
some default value if nothing is specified; the consensus parameter
overrides this default value; the torrc option overrides both.
FFMeasurementsAllowed: A global toggle on whether or not to allow
measurements. Even if all other settings would allow a measurement, if
this is turned off, then no measurement is allowed. Possible values: 0,
1. Default: 0 (disallowed).
FFAllowedCoordinators: The list of coordinator TLS certificate
fingerprints that are allowed to start measurements. Relays check their
torrc when they receive a connection from a FlashFlow coordinator to see
if it's on the list. If they have no list, they check the consensus
parameter. If nether exist, then no FlashFlow deployment is allowed to
measure this relay. Default: empty list.
FFMeasurementPeriod: A relay should expect on average, to be measured by
each FlashFlow deployment once each measurement period. A relay will not
allow itself to be measured more than twice by a FlashFlow deployment in
any time window of this length. Relays should not change this option
unless they really know what they're doing. Changing it at the relay
will not change how often FlashFlow will attempt to measure the relay.
Possible values are in the range [1 hour, 1 month] inclusive. Default: 1
day.
FFBackgroundTrafficPercent: The maximum amount of regular
non-measurement traffic a relay should handle while being measured, as a
percent of total traffic (measurement + non-measurement). This
parameter is a trade off between having to limit background traffic and
limiting how much a relay can inflate its result by handling no
background traffic but reporting that it has done so. Possible values
are in the range [0, 99] inclusive. Default: 25 (a maximum inflation
factor of 1.33).
FFMaxMeasurementDuration: The maximum amount of time, in seconds, that
is allowed to pass from the moment the relay is notified that a
measurement will begin soon and the end of the measurement. If this
amount of time passes, the relay shall close all measurement connections
and exit its measurement mode. Note this duration includes handshake
time, thus it necessarily is larger than the expected actual measurement
duration. Possible values are in the range [10, 120] inclusive.
Default: 45.
3.1.2 New Cell Types
FlashFlow will introduce a new cell command MEASURE.
The payload of each MEASURE cell consists of:
Measure command [1 byte]
Length [2 bytes]
Data [Length-3 bytes]
The measure commands are:
0 -- MSM_PARAMS [forward]
1 -- MSM_PARAMS_OK [backward]
2 -- MSM_ECHO [forward and backward]
3 -- MSM_BG [backward]
4 -- MSM_ERR [forward and backward]
Forward cells are sent from the measurer/coordinator to the relay.
Backward cells are sent from the relay to the measurer/coordinator.
MSM_PARAMS and MSM_PARAMS_OK are used during the pre-measurement stage
to tell the target what to expect and for the relay to positively
acknowledge the message. MSM_ECHO cells are the measurement traffic;
the measurer generates them, sends them to the target, and the target
echos them back. The target send a MSM_BG cell once per second to report
the amount of background traffic it is handling. MSM_ERR cells are used
to signal to the other party that there has been some sort of problem
and that the measurement should be aborted. These measure commands are
described in more detail in the next section.
The only cell that sometimes undergoes cell encryption is MSM_ECHO; no
other cell ever gets cell encrypted. (All cells are transmitted on a
regular TLS-wrapped OR connection; that encryption still exists.)
The relay "decrypts" MSM_ECHO cells before sending them back to the
measurer; this mirrors the way relays decrypt/encrypt RELAY_DATA cells
in order to induce realistic cryptographic CPU load. The measurer
usually skips encrypting MSM_ECHO cells to reduce its own CPU load;
however, to verify the relay is actually correctly decrypting all cells,
the measurer will choose random outgoing cells, encrypt them, remember
the ciphertext, and verify the corresponding incoming cell matches.
3.1.3 Pre-Measurement Handshaking/Starting a Measurement
The coordinator connects to the target relay and sends it a MSM_PARAMS
cell. If the target is unwilling to be measured at this time or if the
coordinator didn't use a TLS certificate that the target trusts, it
responds with an error cell and closes the connection. Otherwise it
checks that the parameters of the measurement are acceptable (e.g. the
version is acceptable, the duration isn't too long, etc.). If the
target is happy, it sends a MSM_PARAMS_OK, otherwise it sends a MSM_ERR
and closes the connection.
Upon learning the IP addresses of the measurers from the coordinator in
the MSM_PARAMS cell, the target whitelists their IPs in its DoS
detection subsystem until the measurement ends (successfully or
otherwise), at which point the whitelist is cleared.
Upon receiving a MSM_PARAMS_OK from the target, the coordinator will
instruct the measurers to open their TCP connections with the target. If
the coordinator or any measurer receives a MSM_ERR, it reports the error
to the coordinator and considers the measurement a failure. It is also a
failure if any measurer is unable to open at least half of its TCP
connections with the target.
The payload of MSM_PARAMS cells [XXX more may need to be added]:
- version [1 byte]
- msm_duration [1 byte]
- num_measurers [1 byte]
- measurer_info [num_measurers times]
- ipv4_addr [4 bytes]
- num_conns [2 bytes]
version dictates how this MSM_PARAMS cell shall be parsed. msm_duration
is the duration, in seconds, that the actual measurement will last.
num_measurers is how many measurer_info structs follow. For each
measurer, the ipv4_addr it will use when connecting to the target is
provided, as is num_conns, the number of TCP connections that measurer
will open with the target. Future versions of FlashFlow and MSM_PARAMS
will use TLS certificates instead of IP addresses.
MSM_PARAMS_OK has no payload: it's just padding bytes to make the cell
514 bytes long.
The payload of MSM_ECHO cells:
- arbitrary bytes [max to fill up 514 byte cell]
The payload of MSM_BG cells:
- second [1 byte]
- sent_bg_bytes [4 bytes]
- recv_bg_bytes [4 bytes]
second is the number of seconds since the measurement began. MSM_BG
cells are sent once per second from the relay to the FlashFlow
coordinator. The first cell will have this set to 1, and each
subsequent cell will increment it by one. sent_bg_bytes is the number of
background traffic bytes sent in the last second (since the last MSM_BG
cell). recv_bg_bytes is the same but for received bytes.
The payload of MSM_ERR cells:
- err_code [1 byte]
- err_str [possibly zero-len null-terminated string]
The error code is one of:
[... XXX TODO ...]
255 -- OTHER
The error string is optional in all cases. It isn't present if the first
byte of err_str is null, otherwise it is present. It ends at the first
null byte or the end of the cell, whichever comes first.
3.1.4 Measurement Mode
The relay considers the measurement to have started the moment it
receives the first MSM_ECHO cell from any measurer. At this point, the
relay
- Starts a repeating 1s timer on which it will report the amount of
background traffic to the coordinator over the coordinator's
connection.
- Enters "measurement mode" and limits the amount of background
traffic it handles according to the torrc option/consensus
parameter.
The relay decrypts and echos back all MSM_ECHO cells it receives on
measurement connections until it has reported its amount of background
traffic the same number of times as there are seconds in the measurement
(e.g. 30 per-second reports for a 30 second measurement). After sending
the last MSM_BG cell, the relay drops all buffered MSM_ECHO cells,
closes all measurement connections, and exits measurement mode.
During the measurement the relay targets a ratio of background traffic
to measurement traffic as specified by a consensus parameter/torrc
option. For a given ratio r, if the relay has handled x cells of
measurement traffic recently, Tor then limits itself to y = xr/(1-r)
cells of non-measurement traffic this scheduling round. The target will
enforce that a minimum of 10 Mbit/s of measurement traffic is recorded
since the last background traffic scheduling round to ensure it always
allows some minimum amount of background traffic.
3.2 FlashFlow Components
The FF coordinator and measurer code will reside in a FlashFlow
repository separate from little-t tor.
There are three notable parameters for which a FF deployment must choose
values. They are:
- The number of sockets, s, the measurers should open, in aggregate,
with the target relay. We suggest s=160 based on the FF paper.
- The bandwidth multiplier, m. Given an existing capacity estimate for
a relay, z, the coordinator will instruct the measurers to, in
aggregate, send m*z Mbit/s to the target relay. We recommend m=2.25.
- The measurement duration, d. Based on the FF paper, we recommend
d=30 seconds.
The rest of this section first discusses notable functions of the
FlashFlow coordinator, then goes on to discuss FF measurer code that
will require supporting tor code.
3.2.1 FlashFlow Coordinator
The coordinator is responsible for scheduling measurements, aggregating
results, and producing v3bw files. It needs continuous access to new
consensus files, which it can obtain by running an accompanying Tor
process in client mode.
The coordinator has the following functions, which will be described in
this section:
- result aggregation.
- schedule measurements.
- v3bw file generation.
3.2.1.1 Aggregating Results
Every second during a measurement, the measurers send the amount of
verified measurement traffic they have received back from the relay.
Additionally, the relay sends a MSM_BG cell each second to the
coordinator with amount of non-measurement background traffic it is
sending and receiving.
For each second's reports, the coordinator sums the measurer's reports.
The coordinator takes the minimum of the relay's reported sent and
received background traffic. If, when compared to the measurer's reports
for this second, the relay's claimed background traffic is more than
what's allowed by the background/measurement traffic ratio, then the
coordinator further clamps the relay's report down. The coordinator adds
this final adjusted amount of background traffic to the sum of the
measurer's reports.
Once the coordinator has done the above for each second in the
measurement (e.g. 30 times for a 30 second measurement), the coordinator
takes the median of the 30 per-second throughputs and records it as the
estimated capacity of the target relay.
3.2.1.2 Measurement Schedule
The short term implementation of measurement scheduling will be simpler
than the long term one due to (1) there only being one FlashFlow
deployment, and (2) there being very few relays that support being
measured by FlashFlow. In fact the FF coordinator will maintain a list
of the relays that have updated to support being measured and have opted
in to being measured, and it will only measure them.
The coordinator divides time into a series of 24 hour periods, commonly
referred to as days. Each period has measurement slots that are longer
than a measurement lasts (30s), say 60s, to account for pre- and
post-measurement work. Thus with 60s slots there's 1,440 slots in a
day.
At the start of each day the coordinator considers the list of relays
that have opted in to being measured. From this list of relays, it
repeatedly takes the relay with the largest existing capacity estimate.
It selects a random slot. If the slot has existing relays assigned to
it, the coordinator makes sure there is enough additional measurer
capacity to handle this relay. If so, it assigns this relay to this
slot. If not, it keeps picking new random slots until one has sufficient
additional measurer capacity.
Relays without existing capacity estimates are assumed to have the 75th
percentile capacity of the current network.
If a relay is not online when it's scheduled to be measured, it doesn't
get measured that day.
3.2.1.2.1 Example
Assume the FF deployment has 1 Gbit/s of measurer capacity. Assume the
chosen multiplier m=2. Assume there are only 5 slots in a measurement
period.
Consider a set of relays with the following existing capacity estimates
and that have opted in to being measured by FlashFlow.
- 500 Mbit/s
- 300 Mbit/s
- 250 Mbit/s
- 200 Mbit/s
- 100 Mbit/s
- 50 Mbit/s
The coordinator takes the largest relay, 500 Mbit/s, and picks a random
slot for it. It picks slot 3. The coordinator takes the next largest,
300, and randomly picks slot 2. The slots are now:
0 | 1 | 2 | 3 | 4
-------|-------|-------|-------|-------
| | 300 | 500 |
| | | |
The coordinator takes the next largest, 250, and randomly picks slot 2.
Slot 2 already has 600 Mbit/s of measurer capacity reserved (300*m);
given just 1000 Mbit/s of total measurer capacity, there is just 400
Mbit/s of spare capacity while this relay requires 500 Mbit/s. There is
not enough room in slot 2 for this relay. The coordinator picks a new
random slot, 0.
0 | 1 | 2 | 3 | 4
-------|-------|-------|-------|-------
250 | | 300 | 500 |
| | | |
The next largest is 200 and the coordinator randomly picks slot 2 again
(wow!). As there is just enough spare capacity, the coordinator assigns
this relay to slot 2.
0 | 1 | 2 | 3 | 4
-------|-------|-------|-------|-------
250 | | 300 | 500 |
| | 200 | |
The coordinator randomly picks slot 4 for the last remaining relays, in
that order.
0 | 1 | 2 | 3 | 4
-------|-------|-------|-------|-------
250 | | 300 | 500 | 100
| | 200 | | 50
3.2.1.3 Generating V3BW files
Every hour the FF coordinator produces a v3bw file in which it stores
the latest capacity estimate for every relay it has measured in the last
week. The coordinator will create this file on the host's local file
system. Previously-generated v3bw files will not be deleted by the
coordinator. A symbolic link at a static path will always point to the
latest v3bw file.
$ ls -l
v3bw -> v3bw.2020-03-01-05-00-00
v3bw.2020-03-01-00-00-00
v3bw.2020-03-01-01-00-00
v3bw.2020-03-01-02-00-00
v3bw.2020-03-01-03-00-00
v3bw.2020-03-01-04-00-00
v3bw.2020-03-01-05-00-00
3.2.2 FlashFlow Measurer
The measurers take commands from the coordinator, connect to target
relays with many sockets, send them traffic, and verify the received
traffic is the same as what was sent. Measurers need access to a lot of
internal tor functionality. One strategy is to house as much logic as
possible inside an compile-time-optional control port module that calls
into other parts of tor. Alternatively FlashFlow could link against tor
and call internal tor functions directly.
[XXX for now I'll assume that an optional little-t tor control port
module housing a lot of this code is the best idea.]
Notable new things that internal tor code will need to do on the
measurer (client) side:
1. Open many TLS+TCP connections to the same relay on purpose.
2. Verify echo cells.
3.2.2.1 Open many connections
FlashFlow prototypes needed to "hack in" a flag in the
open-a-connection-with-this-relay function call chain that indicated
whether or not we wanted to force a new connection to be created. Most
of Tor doesn't care if it reuses an existing connection, but FF does
want to create many different connections. The cleanest way to
accomplish this will be investigated.
On the relay side, these measurer connections do not count towards DoS
detection algorithms.
3.2.2.2 Verify echo cells
A parameter will exist to tell the measurers with what frequency they
shall verify that cells echoed back to them match what was sent. This
parameter does not need to exist outside of the FF deployment (e.g. it
doesn't need to be a consensus parameter).
The parameter instructs the measurers to check 1 out of every N cells.
The measurer keeps a count of how many measurement cells it has sent. It
also logically splits its output stream of cells into buckets of size N.
At the start of each bucket (when num_sent % N == 0), the measurer
chooses a random index in the bucket. Upon sending the cell at that
index (num_sent % N == chosen_index), the measurer records the cell.
The measurer also counts cells that it receives. When it receives a cell
at an index that was recorded, it verifies that the received cell
matches the recorded sent cell. If they match, no special action is
taken. If they don't match, the measurer indicates failure to the
coordinator and target relay and closes all connections, ending the
measurement.
3.2.2.2.1 Example
Consider bucket_size is 1000. For the moment ignore cell encryption.
We start at idx=0 and pick an idx in [0, 1000) to record, say 640. At
idx=640 we record the cell. At idx=1000 we choose a new idx in [1000,
2000) to record, say 1236. At idx=1236 we record the cell. At idx=2000
we choose a new idx in [2000, 3000). Etc.
There's 2000+ cells in flight and the measurer has recorded two items:
- (640, contents_of_cellA)
- (1236, contents_of_cellB)
Consider the receive side now. It counts the cells it receives. At
receive idx=640, it checks the received cell matches the saved cell from
before. At receive idx=1236, it again checks the received cell matches.
Etc.
3.2.2.2.2 Motivation
A malicious relay may want to skip decryption of measurement cells to
save CPU cycles and obtain a higher capacity estimate. More generally,
it could generate fake measurement cells locally, ignore the measurement
traffic it is receiving, and flood the measurer with more traffic that
it (the measurer) is even sending.
The security of echo cell verification is discussed in section 3.3.1.
3.3 Security
In this section we discuss the security of various aspects of FlashFlow
and the tor changes it requires.
3.3.1 Echo Cell Verification: Bucket Size
A smaller bucket size means more cells are checked and FF is more likely
to detect a malicious target. It also means more bookkeeping overhead
(CPU/RAM).
An adversary that knows bucket_size and cheats on one item out of every
bucket_size items will have a 1/bucket_size chance of getting caught in
the first bucket. This is the worst case adversary. While cheating on
just a single item per bucket yields very little advantage, cheating on
more items per bucket increases the likelihood the adversary gets
caught. Thus only the worst case is considered here.
In general, the odds the adversary can successfully cheat in a single
bucket are
(bucket_size-1)/bucket_size
Thus the odds the adversary can cheat in X consecutive buckets are
[(bucket_size-1)/bucket_size]^X
In our case, X will be highly varied: Slow relays won't see very many
buckets, but fast relays will. The damage to the network a very slow
relay can do by faking being only slightly faster is limited.
Nonetheless, for now we motivate the selection of bucket_size with a
slow relay:
- Assume a very slow relay of 1 Mbit/s capacity that will cheat 1 cell
in each bucket. Assume a 30 second measurement.
- The relay will handle 1*30 = 30 Mbit of traffic during the
measurement, or 3.75 MB, or 3.75 million bytes.
- Cells are 514 bytes. Approximately (e.g. ignoring TLS) 7300 cells
will be sent/recv over the course of the measurement.
- A bucket_size of 50 results in about 146 buckets over the course of
the 30s measurement.
- Therefore, the odds of the adversary cheating successfully as
(49/50)^(146), or about 5.2%.
This sounds high, but a relay capable of double the bandwidth (2 Mbit/s)
will have (49/50)^(2*146) or 0.2% odds of success, which is quite low.
Wanting a <1% chance that a 10 Mbit/s relay can successfully cheat
results in a bucket size of approximately 125:
- 10*30 = 300 Mbit of traffic during 30s measurement. 37.5 million
bytes.
- 37,500,000 bytes / 514 bytes/cell = ~73,000 cells
- bucket_size of 125 cells means 73,000 / 125 = 584 buckets
- (124/125)^(584) = 0.918% chance of successfully cheating
Slower relays can cheat more easily but the amount of extra weight they
can obtain is insignificant in absolute terms. Faster relays are
essentially unable to cheat.
3.3.2 Weight Inflation
Target relays are an active part of the measurement process; they know
they are getting measured. While a relay cannot fake the measurement
traffic, it can trivially stop transferring client background traffic
for the duration of the measurement yet claim it carried some. More
generally, there is no verification of the claimed amount of background
traffic during the measurement. The relay can claim whatever it wants,
but it will not be trusted above the ratio the FlashFlow deployment is
configured to know. This places an easy to understand, firm, and (if set
as we suggest) low cap on how much a relay can inflate its measured
capacity.
Consider a background/measurement ratio of 1/4, or 25%. Assume the relay
in question has a hard limit on capacity (e.g. from its NIC) of 100
Mbit/s. The relay is supposed to use up to 25% of its capacity for
background traffic and the remaining 75%+ capacity for measurement
traffic. Instead the relay ceases carrying background traffic, uses all
100 Mbit/s of capacity to handle measurement traffic, and reports ~33
Mbit/s of background traffic (33/133 = ~25%). FlashFlow would trust this
and consider the relay capable of 133 Mbit/s. (If the relay were to
report more than ~33 Mbit/s, FlashFlow limits it to just ~33 Mbit/s.)
With r=25%, FlashFlow only allows 1.33x weight inflation.
Prior work shows that Torflow allows weight inflation by a factor of 89x
[0] or even 177x [1].
The ratio chosen is a trade-off between impact on background traffic and
security: r=50% allows a relay to double its weight but won't impact
client traffic for relays with steady state throughput below 50%, while
r=10% allows a very low inflation factor but will cause throttling of
client traffic at far more relays. We suggest r=25% (and thus
1/(1-0.25)=1.33x inflation) for a reasonable trade-off between
performance and security.
It may be possible to catch relays performing this attack, especially if
they literally drop all background traffic during the measurement: have
the measurer (or some party on its behalf) create a regular stream
through the relay and measure the throughput on the stream
before/during/after the measurement. This can be explored longer term.
3.3.3 Incomplete Authentication
The short term FlashFlow implementation has the relay set two torrc
options if they would like to allow themselves to be measured: a flag
allowing measurement, and the list of coordinator TLS certificate that
are allowed to start a measurement.
The relay drops MSM_PARAMS cells from coordinators it does not trust,
and immediately closes the connection after that. A FF coordinator
cannot convince a relay to enter measurement mode unless the relay
trusts its TLS certificate.
A trusted coordinator specifies in the MSM_PARAMS cell the IP addresses
of the measurers the relay shall expect to connect to it shortly. The
target adds the measurer IP addresses to a whitelist in the DoS
connection limit system, exempting them from any configured connection
limit. If a measurer is behind a NAT, an adversary behind the same NAT
can DoS the relay's available sockets until the end of the measurement.
The adversary could also pretend to be the measurer. Such an adversary
could induce measurement failures and inaccuracies. (Note: the whitelist
is cleared after the measurement is over.)
4. FlashFlow measurement system: Medium term
The medium term deployment stage begins after FlashFlow has been
implemented and relays are starting to update to a version of Tor that
supports it.
We plan to host a FlashFlow deployment consisting of a FF coordinator
and a single FF measurer on a single 1 Gbit/s machine. Data produced by
this deployment will be made available (semi?) publicly, including both
v3bw files and intermediate results.
Any development changes needed during this time would go through
separate proposals.
5. FlashFlow measurement system: Long term
In the long term, finishing-touch development work will be done,
including adding better authentication and measurement scheduling, and
experiments will be run to determine the best way to integrate FlashFlow
into the Tor ecosystem.
Any development changes needed during this time would go through
separate proposals.
5.1 Authentication to Target Relay
Short term deployment already had FlashFlow coordinators using TLS
certificates when connecting to relays, but in the long term, directory
authorities will vote on the consensus parameter for which coordinators
should be allowed to perform measurements. The voting is done in the
same way they currently vote on recommended tor versions.
FlashFlow measurers will be updated to use TLS certificates when
connecting to relays too. FlashFlow coordinators will update the
contents of MSM_PARAMS cells to contain measurer TLS certificates
instead of IP addresses, and relays will update to expect this change.
5.2 Measurement Scheduling
Short term deployment only has one FF deployment running. Long term this
may no longer be the case because, for example, more than one directory
authority decides to adopt it and they each want to run their own
deployment. FF deployments will need to coordinate between themselves
to not measure the same relay at the same time, and to handle new relays
as they join during the middle of a measurement period (during the day).
The following is quoted from Section 4.3 of the FlashFlow paper.
To measure all relays in the network, the BWAuths periodically
determine the measurement schedule. The schedule determines when and
by whom a relay should be measured. We assume that the BWAuths have
sufficiently synchronized clocks to facilitate coordinating their
schedules. A measurement schedule is created for each measurement
period, the length p of which determines how often a relay is
measured. We use a measurement period of p = 24 hours.
To help avoid active denial-of-service attacks on targeted relays,
the measurement schedule is randomized and known only to the
BWAuths. Before the next measurement period starts, the BWAuths
collectively generate a random seed (e.g. using Tor’s
secure-randomness protocol). Each BWAuth can then locally determine
the shared schedule using pseudorandom bits extracted from that
seed. The algorithm to create the schedule considers each
measurement period to be divided into a sequence of t-second
measurement slots. For each old relay, slots for each BWAuth to
measure it are selected uniformly at random without replacement
from all slots in the period that have sufficient unallocated
measurement capacity to accommodate the measurement. When a new
relay appears, it is measured separately by each BWAuth in the first
slots with sufficient unallocated capacity. Note that this design
ensures that old relays will continue to be measured, with new
relays given secondary priority in the order they arrive.
5.3 Experiments
[XXX todo]
5.4 Other Changes/Investigations/Ideas
- How can FlashFlow data be used in a way that doesn't lead to poor load
balancing given the following items that lead to non-uniform client
behavior:
- Guards that high-traffic HSs choose (for 3 months at a time)
- Guard vs middle flag allocation issues
- New Guard nodes (Guardfraction)
- Exit policies other than default/all
- Directory activity
- Total onion service activity
- Super long-lived circuits
- Add a cell that the target relay sends to the coordinator indicating
its CPU and memory usage, whether it has a shortage of sockets, how
much bandwidth load it has been experiencing lately, etc. Use this
information to lower a relays weight, never increase.
- If FlashFlow and sbws work together (as opposed to FlashFlow replacing
sbws), consider logic for how much sbws can increase/decrease FF
results
- Coordination of multiple FlashFlow deployments: scheduling of
measurements, seeding schedule with shared random value.
- Other background/measurement traffic ratios. Dynamic? (known slow
relay => more allowed bg traffic?)
- Catching relays inflating their measured capacity by dropping
background traffic.
- What to do about co-located relays. Can they be detected reliably?
Should we just add a torrc option a la MyFamily for co-located relays?
- What is the explanation for dennis.jackson's scary graphs in this [2]
ticket? Was it because of the speed test? Why? Will FlashFlow produce
the same behavior?
6. Citations
[0] F. Thill. Hidden Service Tracking Detection and Bandwidth Cheating
in Tor Anonymity Network. Master’s thesis, Univ. Luxembourg, 2014.
[1] A. Johnson, R. Jansen, N. Hopper, A. Segal, and P. Syverson.
PeerFlow: Secure Load Balancing in Tor. Proceedings on Privacy
Enhancing Technologies (PoPETs), 2017(2), April 2017.
[2] Mike Perry: Graph onionperf and consensus information from Rob's
experiments https://trac.torproject.org/projects/tor/ticket/33076
5
8
```
Filename: 320-tap-out-again.md
Title: Removing TAP usage from v2 onion services
Author: Nick Mathewson
Created: 11 May 2020
Status: Open
```
(This proposal is part of the Walking Onions spec project. It updates
proposal 245.)
# Removing TAP from v2 onion services
As we implement walking onions, we're faced with a problem: what to do
with TAP keys? They are bulky and insecure, and not used for anything
besides v2 onion services. Keeping them in SNIPs would consume
bandwidth, and keeping them indirectly would consume complexity. It
would be nicer to remove TAP keys entirely.
But although v2 onion services are obsolescent and their
cryptographic parameters are disturbing, we do not want to drop
support for them as part of the Walking Onions migration. If we did
so, then we would force some users to choose between Walking Onions
and v2 onion services, which we do not want to do.
Instead, we describe here a phased plan to replace TAP in v2 onion
services with ntor. This change improves the forward secrecy of
_some_ of the session keys used with v2 onion services, but does not
improve their authentication, which is strongly tied to truncated
SHA1 hashes of RSA1024 keys.
Implementing this change is more complex than similar changes
elsewhere in the Tor protocol, since we do not want clients or
services to leak whether they have support for this proposal, until
support is widespread enough that revealing it is no longer a
privacy risk.
## Ntor keys, link specifiers, and SNIPs in v2 descriptors.
We define these entries that may appear in v2 onion service
descriptors, once per introduction point.
"identity-ed25519"
"ntor-onion-key"
[at most once each per intro point.]
These values are in the same format as and follow the same
rules as their equivalents in router descriptors.
"link-specifiers"
[at most once per introduction point]
This value is the same as the link specifiers in a v3 onion
service descriptor, and follows the same rules.
Services should not include any of these fields unless a new network
parameter, "hsv2-intro-updated" is set to 1. Clients should not parse
these fields or use them unless "hsv2-use-intro-updated" is set to 1.
We define a new field that can be used for hsv2 descriptors with
walking onions:
"snip"
[at most once]
This value is the same as the snip field introduced to a v3
onion service descriptor by proposal (XXX) and follows the
same rules.
Services should not include this field unless a new network parameter,
"hsv2-intro-snip" is set to 1. Clients should not parse this field or use it
unless the parameter "hsv2-use-intro-snip" is set to 1.
Additionally, relays SHOULD omit the following legacy intro point
parameters when a new network parameter, "hsv2-intro-legacy" is set
to 0: "ip-address", "onion-port", and "onion-key". Clients should
treat them as optional when "hsv2-tolerate-no-legacy" is set to 1.
## INTRODUCE cells, RENDEZVOUS cells, and ntor.
We allow clients to specify the rendezvous point's ntor key in the
INTRODUCE2 cell instead of the TAP key. To do this, the client
simply sets KLEN to 32, and includes the ntor key for the relay.
Clients should only use ntor keys in this way if the network parameter
"hsv2-client-rend-ntor" is set to 1, and if the entry "allow-rend-ntor"
is present in the onion service descriptor.
Services should only advertise "allow-rend-ntor" in this way if the
network parameter "hsv2-service-rend-ntor" is set to 1.
## Migration steps
First, we implement all of the above, but set it to be disabled by
default. We use torrc fields to selectively enable them for testing
purposes, to make sure they work.
Once all non-LTS versions of Tor without support for this proposal are
obsolete, we can safely enable "hsv2-client-rend-ntor",
"hsv2-service-rend-ntor", "hsv2-intro-updated", and
"hsv2-use-intro-updated".
Once all non-LTS versions of Tor without support for walking onions
are obsolete, we can safely enable "hsv2-intro-snip",
"hsv2-use-intro-snip", and "hsv2-tolerate-no-legacy".
Once all non-LTS versions of Tor without support for _both_ of the
above implementations are finally obsolete, we can finally set
"hsv2-intro-legacy" to 0.
## Future work
There is a final TAP-like protocol used for v2 hidden services: the
client uses RSA1024 and DH1024 to send information about the
rendezvous point and to start negotiating the session key to be used
for end-to-end encryption.
In theory we could get a benefit to forward secrecy by using ntor
instead of TAP here, but we would get not corresponding benefit for
authentication, since authentication is still ultimately tied to
HSv2's scary RSA1024-plus-truncated-SHA1 combination.
Given that, it might be just as good to allow the client to insert
a curve25519 key in place of their DH1024 key, and use that for the
DH handshake instead. That would be a separate proposal, though:
this proposal is enough to allow all relays to drop TAP support.
8
11
27 May '20
```
Filename: 322-dirport-linkspec.md
Title: Extending link specifiers to include the directory port
Author: Nick Mathewson
Created: 27 May 2020
Status: Open
```
## Motivation
Directory ports remain the only way to contact a (non-bridge) Tor
relay that isn't expressible as a Link Specifier. We haven't
specified a link specifier of this kind so far, since it isn't a way
to contact a relay to create a channel.
But authorities still expose directory ports, and encourage relays
to use them preferentially for uploading and downloading. And with
Walking Onions, it would be convenient to try to make every kind of
"address" a link specifier -- we'd like want authorities to be able
to specify a list of link specifiers that can be used to contact
them for uploads and downloads.
> It is possible that after revision, Walking Onions won't need a way
> to specify this information. If so, this proposal should be moved
> to "Reserve" status as generally unuseful.
## Proposal
We reserve a new link specifier type "dir-url", for use with the
directory system. This is a variable-length link specifier, containing
a URL prefix. The only currently supported URL schema is "http://".
Implementations SHOULD ignore unrecognized schemas. IPv4 and IPv6
addresses MAY be used directory; hostnames are also allowed.
Implementations MAY ignore hostnames and only use raw addresses.
The URL prefix includes everything through the string "tor" in the
directory hierarchy.
A dir-url link specifier SHOULD NOT appear in an EXTEND cell;
implementations SHOULD reject them if they do appear.
1
0
Proposal 321: Better performance and usability for the MyFamily option (v2)
by Nick Mathewson 27 May '20
by Nick Mathewson 27 May '20
27 May '20
```
Filename: 321-happy-families.md
Title: Better performance and usability for the MyFamily option (v2)
Author: Nick Mathewson
Created: 27 May 2020
Status: Open
```
## Problem statement.
The current family mechanism allows well-behaved relays to
identify that they all belong to the same 'family', and should
not be used in the same circuits.
Right now, families work by having every family member list every
other family member in its server descriptor. This winds up using
O(n^2) space in microdescriptors and server descriptors. (For RAM,
we can de-duplicate families which sometimes helps.) Adding or
removing a server from the family requires all the other servers to
change their torrc settings.
The is growth in size is not just a theoretical problem. Family
declarations currently make up a little over 55% of the
microdescriptors in the directory--around 24% after compression.
The largest family has around 270 members. With Walking Onions, 270
members times a 160-bit hashed identifier leads to over 5 kilobytes
per SNIP, which is much greater than we'd want to use.
This is an updated version of proposal 242. It differs by clarifying
requirements and providing a more detailed migration plan.
## Design overview.
In this design, every family has a master ed25519 "family key". A node
is in the family iff its server descriptor includes a certificate of its
ed25519 identity key with the family key. The certificate
format the one in the tor-certs.txt spec; we would allocate a new
certificate type for this usage. These certificates would need to
include the signing key in the appropriate extension.
Note that because server descriptors are signed with the node's
ed25519 signing key, this creates a bidirectional relationship
between the two keys, so that nodes can't be put in families without
their consent.
## Changes to router descriptors
We add a new entry to server descriptors:
"family-cert" NL
"-----BEGIN FAMILY CERT-----" NL
cert
"-----END FAMILY CERT-----".
This entry contains a base64-encoded certificate as described
above. It may appear any number of times; authorities MAY reject
descriptors that include it more than three times.
## Changes to microdescriptors
We add a new entry to microdescriptors: `family-keys`.
This line contains one or more space-separated strings describing
families to which the node belongs. These strings MUST be sorted in
lexicographic order. Clients MUST NOT depend on any particular property
of these strings.
## Changes to voting algorithm
We allocate a new consensus method number for voting on these keys.
When generating microdescriptors using a suitable consensus method,
the authorities include a "family-keys" line if the underlying
server descriptor contains any valid family-cert lines. For each
valid family-cert in the server descriptor, they add a
base-64-encoded string of that family-cert's signing key.
> See also "deriving family lines from family-keys?" below for an
> interesting but more difficult extension mechanism that I would
> not recommend.
## Relay configuration
There are several ways that we could configure relays to let them
include family certificates in their descriptors.
The easiest would be putting the private family key on each relay,
so that the relays could generate their own certificates. This is
easy to configure, but slightly risky: if the private key is
compromised on any relay, anybody can claim membership in the
family. That isn't so very bad, however -- all the relays would
need to do in this event would be to move to a new private family
key.
A more orthodox method would be to keep the private key somewhere
offline, and using it to generate a certificate for each relay in
the family as needed. These certificates should be made with
long-enough lifetimes, and relays should warn when they are going to
expire soon.
## Changes to relay behavior
Each relay should track which other relays they have seen using the
same family-key as itself. When generating a router descriptor,
each relay should list all of these relays on the legacy 'family'
line. This keeps the "family" lines up-to-date with "family-keys"
lines for compliant relays.
Relays should continue listing relays in their family lines if they
have seen a relay with that identity using the same family-key at
any time in the last 7 days.
The presence of this line should be configured by a network
parameter, `derive-family-line`.
Relays whose family lines do not stay at least mostly in sync with
their family keys should be marked invalid by the authorities.
## Client behavior
Clients should treat node A and node B as belonging to the same
family if ANY of these is true:
* The client has descriptors for A and B, and A's descriptor lists B
in its family line, and B's descriptor lists A in its family line.
* Client A has descriptors for A and B, and they both contain the
same entry in their family-keys or family-cert.
## Migration
For some time, existing relays and clients will not support family
certificates. Because of this, we try to make sure above the
well-behaved relays will list the same entries in both places.
Once enough clients have migrated to using family
certificates, authorities SHOULD disable `derive-family-line`.
## Security
Listing families remains as voluntary in this design as in today's
Tor, though bad-relay hunters can continue to look for families that
have not adopted a family key.
A hostile relay family could list a "family" line that did not match
its "family-certs" values. However, the only reason to do so would
be in order to launch a client partitioning attack, which is
probably less valuable than the kinds of attacks that they could run
by simply not listing families at all.
## Appendix: deriving family lines from family-keys?
As an alternative, we might declare that _authorities_ should keep
family lines in sync with family-certs. Here is a design sketch of
how we might do that, but I don't think it's actually a good idea,
since it would require major changes to the data flow of the
voting system.
In this design, authorties would include a "family-keys" line in
each router section in their votes corresponding to a relay with any
family-cert. When generating final microdescriptors using this
method, the authorities would use these lines to add entries to the
microdescriptors' family lines:
1. For every relay appearing in a routerstatus's family-keys, the
relays calculate a consensus family-keys value by listing including
all those keys that are listed by a majority of those voters listing
the same router with the same descriptor. (This is the algorithm we
use for voting on other values derived from the descriptor.)
2. The authorities then compute a set of "expanded families": one
for each family key. Each "expanded family" is a set containing
every router in the consensus associated with that key in its consensus
family-keys value.
3. The authorities discard all "expanded families" of size 1 or
smaller.
4. Every router listed for the "expanded family" has every other
router added to the "family" line in its microdescriptor. (The
"family" line is then re-canonicalized according to the rules of
proposal 298 to remove its )
5. Note that the final microdescriptor consensus will include the
digest of the derived microdescriptor in step 4, rather than the
digest of the microdescriptor listed in the original votes. (This
calculation is deterministic.)
The problem with this approach is that authorities would have to s
to fetch microdescriptors they do not have in order to replace their
family lines. Currently, voting never requires an authority to
fetch a microdescriptor from another authority. If we implement
vote compression and diffs as in the Walking Onions proposal,
however, we might suppose that votes could include microdescriptors
directly.
Still, this is likely more complexity than we want for a transition
mechanism.
## Appendix: Deriving family-keys from families??
We might also imagine that authorities could infer which families
exist from the graph of family relationships, and then include
synthetic "family-keys" entries for routers that belong to the same
family.
This has two challenges: first, to compute these synthetic family
keys, the authorities would need to have the same graph of family
relationships to begin with, which once again would require them to
include the complete list of families in their votes.
Secondly, finding all the families is equivalent to finding all
maximal cliques in a graph. This problem is NP-hard in its general
case. Although polynomial solutions exist for nice well-behaved
graphs, we'd still need to worry about hostile relays including
strange family relationships in order to drive the algorithm into
its exponential cases.
## Appendix: New assigned values
We need a new assigned value for the certificate type used for
family signing keys.
We need a new consensus method for placing family-keys lines in
microdescriptors.
## Appendix: New network parameters
* `derive-family-line`: If 1, relays should derive family lines from
observed family-keys. If 0, they do not. Min: 0, Max: 1. Default: 1.
1
0
2
1
Thank you for the insights.
On Wed, May 20 2020, at 12:00 PM, <tor-dev(a)lists.torproject.org>
Send tor-dev mailing list submissions to tor-dev(a)lists.torproject.org To
subscribe or unsubscribe via the World Wide Web, visit
https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-dev or, via
email, send a message with subject or body 'help' to
tor-dev-request(a)lists.torproject.org You can reach the person managing the
list at tor-dev-owner(a)lists.torproject.org When replying, please edit your
Subject line so it is more specific than "Re: Contents of tor-dev
digest..." Today's Topics: 1. Re: Proposal 319: RELAY_FRAGMENT cells (Nick
Mathewson) 2. Re: Proposal 320: Removing TAP usage from v2 onion services
(Nick Mathewson) 3. Re: Proposal 320: Removing TAP usage from v2 onion
services (Nick Mathewson) 4. Re: Proposal 320: Removing TAP usage from v2
onion services (Sebastian Hahn)
----------------------------------------------------------------------
Message: 1 Date: Tue, 19 May 2020 13:29:54 -0400 From: Nick Mathewson To:
tor-dev(a)lists.torproject.org Subject: Re: [tor-dev] Proposal 319:
RELAY_FRAGMENT cells Message-ID: Content-Type: text/plain; charset="UTF-8"
On Thu, May 14, 2020 at 3:15 PM David Goulet wrote: > > On 11 May
(16:47:24), Nick Mathewson wrote: [...] > > # Onion service concerns. > > >
> We allocate a new extension for use in the ESTABLISH_INTRO by onion
services, > > to indicate that they can receive a wide INTRODUCE2 cell.
This extension > > contains: > > > > struct wide_intro2_ok { > > u16
max_len; > > } > > > > We allocate a new extension for use in the
'ESTABLISH_RENDEZVOUS' > > cell, to indicate acceptance of wide
'RENDEZVOUS2' cells. This > > extension contains: > > > > struct
wide_rend2_ok { > > u16 max_len; > > } > > > > (Note that
'ESTABLISH_RENDEZVOUS' cells do not currently have a an > > extension
mechanism. They should be extended to use the same > > extension format as
'ESTABLISH_INTRO' cells, with extensions placed > > after the rendezvous
cookie.) > > Why would a client need to announce wide cells in the
ESTABLISH phase as > opposed to using protover "Relay=N" ? This is not for
announcing support of wide cells -- this is for reporting a setting for how
wide fragmented cells should be. > The maximum length of a fragmented cell
is capped to 2^16 (u16) so we don't > really need the establish process to
inform us of the maximum expected length > but rather use the max_len in
the first fragment? This all comes back to an earlier part of the proposal:
Not all lengths up to 65535 are valid lengths for a fragmented cell. Any
length under 499 bytes SHOULD cause the circuit to close, since that could
fit into a non-fragmented RELAY cell. Parties SHOULD enforce maximum
lengths for cell types that they understand. In other words, I'm imagining
that there is a maximum length for each cell type that is much shorter than
65535, even though we're using two bytes for the length field. The
extension in the establish_intro cell is to tell the intro point the
longest introduce1 cell that it should accept; this extension in the
establish_rend cell is to tell the rendezvous point the longest rendezvous1
cell that it should accept. Another way we could do this would be with a
set of network parameters to describe the maximum length of each fragmented
cell. Do you think that would be simpler? (I can't quite remember why I
specified it this way in the first place.) > Furthermore, ESTABLISH_INTRO
has extensions (only 1 as of today) so they could > also be fragments
themselves and thus I'm not sure I see the point of having > two different
ways of "expecting" fragments for the ESTABLISH_* cells and the >
INTRO/RENDEZVOUS cells? The difference thing here is that everybody can
tell which protocols that a relay supports, but there is no automatic way
to tell which protocols an onion service or client supports. Since
INTRODUCE2/RENDEZVOUS2 cells are handled by these clients, they need to get
opted into by the relays. (I'm not sure I understood the question
completely.) > > # Compatibility > > > > This proposal will require the
allocation of a new 'Relay' protocol version, > > to indicate understanding
of the RELAY_FRAGMENTED command. > > Here is a thought about a DoS vector.
Here goes: > > As an upper limit of 65KB total fragment size, it represents
~126 cells in > total so I could basically send *125* cells and then stop
which will put in > memory a bit more than 64KB and it will stay there
until the last fragment is > received. > > And then I do that on 1000
different circuits bringing the total count in > memory to 64GB. All stuck
there, all "waiting" for the last fragment. > > Our OOM would kick in
killing circuits but it just seems to me a very easy way > to continously
kick the OOM of a _service_ which is pretty bad side channel. A few
responses here: First, we shouldn't allow 65535-byte fragmented cells. The
actual maximum length should be something more like 1024 or 4096 bytes.
Second, we should make sure that when we are reassembling cells, we use the
same buf_t buffers that we use for other stuff. Our buffers are
timestamped, so we can tell which buffer has had data stalling for the
longest, and we should use that to make sure we're killing off the right
circuits preferentially. Third, fragments should only be allowed at an
onion service for INTRODUCE2, and those should only come one at a time from
each introduction point, so the number that it's reassembling at the time
will be limited by the number of intro circuits it has open. It'll be the
the intro points that have to be keeping a bunch of cells in assembly at
once, and be ready to kill off circuits that dawdle too long. Does this
make more sense? If so I'll try to clarify it in the proposal.
1
0
```
Filename: 319-wide-everything.md
Title: RELAY_FRAGMENT cells
Author: Nick Mathewson
Created: 11 May 2020
Status: Open
```
(This proposal is part of the Walking Onions spec project.)
# Introduction
Proposal 249 described a system for `CREATE` cells to become wider, in order to
accommodate hybrid crypto. And in order to send those cell bodies across
circuits, it described a way to split `CREATE` cells into multiple `EXTEND`
cells.
But there are other cell types that can need to be wider too. For
example, `INTRODUCE` and `RENDEZVOUS` cells also contain key material
used for a handshake: if handshakes need to grow larger, then so do
these cells.
This proposal describes an encoding for arbitrary "wide" relay cells,
that can be used to send a wide variant of anything.
To be clear, although this proposal describes a way that all relay
cells can become "wide", I do not propose that wide cells should
actually be _allowed_ for all relay cell types.
# Proposal
We add a new relay cell type: `RELAY_FRAGMENT`. This cell type contains part
of another relay cell. A `RELAY_FRAGEMENT` cell can either introduce a new
fragmented cell, or can continue one that is already in progress.
The format of a RELAY_FRAGMENT body is one of the following:
// First body in a series
struct fragment_begin {
// What relay_command is in use for the underlying cell?
u8 relay_command;
// What will the total length of the cell be once it is reassembled?
u16 total_len;
// Bytes for the cell body
u8 body[];
}
// all other cells.
struct fragment_continued {
// More bytes for the cell body.
u8 body[];
}
To send a fragmented cell, first a party sends a RELAY_FRAGMENT cell
containing a "fragment_begin" payload. This payload describes the total
length of the cell, the relay command
Fragmented cells other than the last one in sequence MUST be sent full of
as much data as possible. Parties SHOULD close a circuit if they receive a
non-full fragmented cell that is not the last fragment in a sequence.
Fragmented cells MUST NOT be interleaved with other relay cells on a circuit,
other than cells used for flow control. (Currently, this is only SENDME
cells.) If any party receives any cell on a circuit, other than a flow
control cell or a RELAY_FRAGEMENT cell, before the fragmented cell is
complete, than it SHOULD close the circuit.
Parties MUST NOT send extra data in fragmented cells beyond the amount given
in the first 'total_len' field.
Not every relay command may be sent in a fragmented cell. In this proposal,
we allow the following cell types to be fragmented: EXTEND2, EXTENDED2,
INTRODUCE1, INTRODUCE2, RENDEZVOUS. Any party receiving a command that they
believe should not be fragmented should close the circuit.
Not all lengths up to 65535 are valid lengths for a fragmented cell. Any
length under 499 bytes SHOULD cause the circuit to close, since that could
fit into a non-fragmented RELAY cell. Parties SHOULD enforce maximum lengths
for cell types that they understand.
All `RELAY_FRAGMENT` cells for the fragmented cell must have the
same Stream ID. (For those cells allowed above, the Stream ID is
always zero.) Implementations SHOULD close a circuit if they
receive fragments with mismatched Stream ID.
# Onion service concerns.
We allocate a new extension for use in the ESTABLISH_INTRO by onion services,
to indicate that they can receive a wide INTRODUCE2 cell. This extension
contains:
struct wide_intro2_ok {
u16 max_len;
}
We allocate a new extension for use in the `ESTABLISH_RENDEZVOUS`
cell, to indicate acceptance of wide `RENDEZVOUS2` cells. This
extension contains:
struct wide_rend2_ok {
u16 max_len;
}
(Note that `ESTABLISH_RENDEZVOUS` cells do not currently have a an
extension mechanism. They should be extended to use the same
extension format as `ESTABLISH_INTRO` cells, with extensions placed
after the rendezvous cookie.)
# Handling RELAY_EARLY
The first fragment of each EXTEND cell should be tagged with `RELAY_EARLY`.
The remaining fragments should not. Relays should accept `EXTEND` cells if and
only if their _first_ fragment is tagged with `RELAY_EARLY`.
> Rationale: We could allow any fragment to be tagged, but that would give
> hostile guards an opportunity to move RELAY_EARLY tags around and build a
> covert channel. But if we later move to a relay encryption method that
> lets us authenticate RELAY_EARLY, we could then require only that _any_
> fragment has RELAY_EARLY set.
# Compatibility
This proposal will require the allocation of a new 'Relay' protocol version,
to indicate understanding of the RELAY_FRAGMENTED command.
4
3
On Fri, 15 May 2020 11:20:54 +1000
teor <teor(a)riseup.net> wrote:
> You could try deleting functions, and then running tor's
> "make test-network-all". If that test still passes, then the code is probably
> unused (in practice).
>
> Chutney's CI does a few more tests for different tor versions, but I'm pretty
> sure they all use the same code.
This is what I will end up doing. I planned on it, but humans tend to
know the purpose of the code better, and computers just run what is
given to them :) so I figured it was best to check here first, with
people more experienced with the codebase.
Thanks,
Caitlin
1
0
On Thu, 14 May 2020 19:58:45 -0400
Nick Mathewson <nickm(a)freehaven.net> wrote:
> isInExpectedDirInfoDocs looks like it might once have done something useful.
My impression is to remove it unless we anticipate needing it in the
near future.
> configure and restart and wait_for_bootstrap are all in use; they are
> invoked by name, from the command line, at the end of runConfigFile,
> where it says `return getattr(network, verb)()`.
Ah, I see now, thank you. I think it would be wise to comment the
functions as such to indicate that they are used on the command line.
> I think the rest are likely to be unused.
I did grep most of these and found no signs of use, so this is my
impression as well.
For now I will make a mental note of all this while I work on more
pertinent improvements to the code, but I do believe it would be nice
to deal with truly unused code sooner rather than later.
Caitlin
1
0