- tor-dev - lists.torproject.org

Towards Congestion Control Deployment in Tor-like Networks
by Mike Perry 02 Jun '20

02 Jun '20

This novelette-length mail is meant to provide the background information necessary to understand choices in congestion control deployment for Tor. Much of this background information won't be included in my planned Tor Proposal, which will hopefully be much shorter, but the choices made there will be informed by this material. This material has been enhanced through review and conversations with Toke Høiland-Jørgensen, g burdell, Rob Jansen, and George Kadianakis. Immense credit goes to Toke in particular, for detailed and thoughtful review, and input on queue management. Like my last mail on this topic, this mail comes with a playlist. Fair warning: it is not exactly easy listening. Still, hopefully it gives you strength and inspiration in these trying times: https://people.torproject.org/~mikeperry/borders/everybody-loves-them/TakeA… Section -I: Recap [RECAP] https://wizardzines.com/zines/networking I strongly believe that congestion control will provide the single largest performance win for the Tor network. Specifically, it will considerably improve both latency and throughput metrics in the best, average, and worst case. In addition to performance, congestion control will also provide the second-largest scalability win on our research horizon (behind Walking Onions), by making the queue length memory requirements of Tor relays independent of the number of circuits carried on that relay. I also believe we can achieve the vast majority of the benefits by deploying an RTT-based system that only clients and Exits need to support, specifically BOOTLEG_RTT_TOR. After this, we might consider more advanced options based on ECN, but their benefits are far less clear, and fraught with peril. This mail assumes that you have read "Options for Congestion Control in Tor-like Networks": https://lists.torproject.org/pipermail/tor-dev/2020-January/014140.html Even if you have already read the Options mail, I recommend reviewing the the SENDME section and the sections marked with TCP_HISTORY, BOOTLEG_RTT_TOR, BACKWARD_ECN_TOR, and perhaps also FORWARD_ECN_TOR, as understanding this mail requires an understanding of those sections. The summary of the other sections from the Options mail is basically this: adding ECN signals to Tor seems straight-forward, but is actually very tricky. Section @: Introduction and summary [INTRO_SUMMARY] Deep understanding of the congestion control problem space is essential because shitty congestion control will not be a huge win, and will reduce the benefit of other performance improvements. Worse still, it may introduce side channels and other anonymity attacks. In particular, our level of congestion latency (and associated relay queue lengths) will be very sensitive to congestion signal response time, to our congestion window update rules, and also to our choice of queue management algorithm (ie: which circuits send cells and/or congestion signals). Our average circuit bandwidth on the network will similarly be very sensitive to the ability of the congestion window to quickly update and converge on the available Bandwidth-Delay Product (BDP) of its path. In other words, we need to make good design choices in the following three different areas: I. Which congestion signal to use [CONGESTION_SIGNALS] II. How to update congestion window [CONTROL_ALGORITHMS] III. What circuits to send the signal on [QUEUE_MANAGEMENT] The next three sections of this mail correspond to those three areas. As before, I have tagged the specific sub-sections that will be used in the Tor Proposal with [PROPOSAL_CANDIDATE]. I have also given tags to section in this document. References to things in this document are enclosed in brackets. References to the Options mail are simply all caps, without brackets. The high-level plan is that we use RTT as a primary congestion signal, as per BOOTLEG_RTT_TOR from the Options mail, and try it out with a few different congestion algorithms, such as [TCP_VEGAS_TOR] and [TCP_WESTWOOD_TOR]. It turns out that when RTT is used as a congestion signal, our current Circuit-EWMA is sufficient to serve as [QUEUE_MANAGEMENT]. It also turns out that Circuit-EWMA actually should protect us from fairness issues in the control algorithms, and from cheaters who try to run different congestion control algorithms. Circuit-EWMA will also deliver RTT congestion signals more heavily to bulk circuits (because it prioritizes their cells below light circuits), causing them to back off in favor of lighter traffic, even as we relay fewer of their cells. Thanks goes to Toke Høiland-Jørgensen for pointing this out. The combination of RTT-based congestion signaling, a decent but simple control algorithm, and Circuit-EWMA will get us the most if not all of the benefits we seek, and only requires clients and Exits to upgrade to use it. Once it is deployed, circuit bandwidth caps will no longer be capped at ~500kb/sec by the fixed window sizes of SENDME; queue latency will fall significantly; memory requirements at relays should plummet; and bottlenecks in the network should dissipate. However, we can still improve upon this further, after this initial deployment. Because the "slow start" phase of TCP can take a long time, we will also want to play around with various initial window sizes, window growth rate parameters, and experiment with using a single BACKWARD_ECN_TOR cell_t.command ECN-style signal. This signal will be sent on circuits as determined by [CODEL_TOR]. An instance of [CODEL_TOR] can be added to each TLS connection, after Circuit-EWMA is applied. While BACKWARD_ECN_TOR is complex and slow to roll out (all relays must support this new cell_t.command), this optimization will likely be worth it because BACKWARD_ECN_TOR can send a signal on a circuit to stop the exponential growth of slow start in *less than* RTT/2 time, which is significantly faster than implicitly measuring an RTT-based signal (which takes RTT plus some measurement lag to act upon). This faster signal means faster termination of slow start, which means less buffer bloat. For stream flow control, on the client side, we should just ack all streams cells with a circuit-level SENDME even though they were not delivered to their blocked stream. The local Tor client should keep queuing them until the application reads them. Thus, a blocked client application stream will not block other streams, but will cause Tor client stream-level queues to build up. For exit-side streams, we should *not* send SENDMEs until the cell has been sent on its exiting upstream connection. This allows TCP pushback on one stream to block the entire circuit, but this is desirable to avoid queuing, and it will only happen in the upload direction. Section I: Selecting Congestion Signals [CONGESTION_SIGNALS] The "Options for Congestion Control in Tor-like Networks" mail probably should have been titled "Options for Congestion Signals in Tor-like Networks", as its primary focus was on the ways Tor endpoints can receive a signal about the congestion status of a circuit on bottleneck relays. In that post, I postponed discussion of queue management, and deliberately under-specified how to react to a congestion signal (going no further than saying "use SlowStart and AIMD"). I chose that scoping because the most difficult barrier we have faced in the past decade of work on this topic is the side channels enabled by a congestion signal (and information leaks in general). However, I should have made our constraints more explicit. Our choice of acceptable congestion signal mechanism is restricted by these design constraints: 1. No side channels between Guard and Exit, or between circuits 2. Incremental upgrade path 3. Fast response time 4. Cheater prevention These design constraints restrict our choice of congestion signal, but they are not our only design constraints in this problem space. When we get to the later sections on window control update algorithm, and queue management algorithm, I will explicitly enumerate additional design constraints there. Hopefully this will provide more clarity than the uncategorized list of "causes of death" in the Options post. For your convenience, here's the summary table of all the options for congestion signal again: ______________________________________________________________________ | MIX | CHEATING | RESPONSE| SIDE | UPGRADE | | TRACK | DETECTION | TIME | CHANNELS | PATH? | |~~~~~~~~~~~~~~~~~~~~~~|~~~~~~~~~~~|~~~~~~~~~|~~~~~~~~~~~~|~~~~~~~~~~| |BOOTLEG_RTT_TOR | Exit RTT |100c+RTT | RTT | Exits | |FORWARD_ECN_TOR | Circ OOM | RTT | None? | Full Net | |FORWARD_ECN_RTT_TOR | Exit RTT | RTT | RTT |Exits;Full| |BACKWARD_ECN_TOR |Middles;OOM| RTT/2 | Guard->Exit| Full Net | |BACKWARD_ECN_RTT_TOR |Middles;RTT|RTT/2;RTT| Guard->Exit|Exits;Full| |BACKWARD_ECN_TRAP_TOR | Middles | RTT/2 |Guard<->Exit| Full Net | |EDGE_FORWARD_ECN_TOR |Middles;OOM| 2*RTT/3 | None? | Full Net | |START_BACKWARD_ECN_TOR|Middles;OOM|RTT/2;RTT| Low/None? | Full Net | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ My short-term favorite is BOOTLEG_RTT_TOR, because it only requires Exit relay support to use. Recall that BOOTLEG_RTT_TOR defines an RTT latency threshold as its congestion signal, and was first described in section 4.1 of the N23 Defenestrator paper. Even though the N23 paper didn't bother to name this system, RTT use as a congestion signal in this way also resembles TCP Vegas, as we shall see. https://www.cypherpunks.ca/~iang/pubs/defenestrator.pdf Note that because Circuit-EWMA will add additional delay to loud circuits, "cheaters" who use alternate congestion control algorithms should end up with more delay than those who do not. This means that if we use RTT as a congestion signal, Circuit-EWMA should provide *more* RTT congestion signal to cheaters, and cheaters would only be punishing themselves with more Circuit-EWMA delay. So cheater resilience for an RTT signal is actually more well handled that I previously thought. Again, thanks goes to Toke Høiland-Jørgensen for pointing this out. For completeness, we will still explore some examples of alternate "cheating" control algorithms in Section II, though. Unfortunately, RTT is itself a side channel, as per: https://www.freehaven.net/anonbib/cache/ccs07-latency-leak.pdf https://www.robgjansen.com/publications/howlow-pets2013.pdf There is also literature on shaping circuit bandwidth to create a side channel. This can be done regardless of the use of congestion control, and is not an argument against using congestion control. In fact, the Backlit defense may be an argument in favor of endpoints monitoring circuit bandwidth and latency more closely, as a defense: https://www.freehaven.net/anonbib/cache/ndss09-rainbow.pdf https://www.freehaven.net/anonbib/cache/ndss11-swirl.pdf https://www.freehaven.net/anonbib/cache/acsac11-backlit.pdf Recall that BACKWARD_ECN_TOR used circuit-level cell_t.command to signal congestion. This allows all relays in the path to signal congestion in under RTT/2 in either direction, and it can be flipped on existing relay cells already in transit, without introducing any overhead. However, because cell_t.command is visible and malleable to all relays, it can also be used as a side channel. So we must limit its use to a couple of cells per circuit, at most. Recall that FORWARD_ECN_TOR adds encrypted end-to-end relay cells that explicitly signal congestion to either the client or the Exit, but requires the client to mediate and echo this cell to protect against side channels. Because congestion signals must be relayed off the client, it still takes the Exit a full RTT to learn about congestion at earlier nodes in the circuit. Additionally, because this cell itself must travel the full path in each direction, it is further delayed by congestion on the path, and it also adds yet more congestion to congested paths. Congestion in Exit to client direction is the most common, as this is the download direction. It will be particularly expensive and slow to bounce the congestion signal off the client to the Exit. For this reason, I think all forms of FORWARD_ECN_TOR are not likely to be as efficient as RTT signaling, despite the FORWARD_ECN_RTT_TOR variant initially being flagged as a PROPOSAL_CANDIDATE in the Options mail. However, we should still keep it in mind, since RTT may turn out to be an unreliable signal by itself. Therefore, since RTT is already measurable with our current SENDME flow control, I propose that we use it via BOOTLEG_RTT_TOR (or TCP Vegas), and then add START_BACKWARD_ECN_TOR only exactly once (to exit slow start). If RTT proves unreliable, then we can experiment with the more heavyweight FORWARD_ECN_TOR full-cell signal. As we shall see in the following control algorithm section, RTT-based systems like TCP Vegas largely avoid the need to send any congestion signal (drop or ECN). In other words: if we can rely on RTT, we should not need any explicit signals at all (though they would improve responsiveness). Section II: Congestion Window Control Algorithms [CONTROL_ALGORITHMS] Recall that the congestion window is the number of packets TCP will allow to be sent on a connection without any acknowledgment. The congestion window is meant to match the bandwidth-delay product of the connection as close as possible. The bandwidth-delay product can be thought of as the total amount of data that can be in-transit on a link at one time without using any queue memory (bytes per second of bandwidth multiplied by seconds of transit time delay = total bytes in transit on the wire at one time). On the Internet, if the congestion window exceeds the bandwidth-delay product of a transmission path, then packets must build up in router queues, and eventually get dropped. If the congestion window is below the bandwidth-delay product of the path, then all routers on the path spend time unused and idle, and thus utilization suffers, even if users are trying to use full bandwidth. TCP variants basically compete on their ability to collectively track the bandwidth-delay product of a path under a wider of a variety of network conditions and other competing traffic. In the interest of brevity, my "Options for Congestion Control in Tor-like Networks" mail suggested that we update the congestion window according to Slow Start with AIMD, in response to a congestion signal (or lack thereof). However, that is not the only option, and if we retain the ability to measure RTT, it may not even be the most efficient one. So to help us make this decision, I did a review of historical TCP and TCP-like systems in the research literature, and will present the highlights here. But first, let's define the control requirements that we want to satisfy: 1. Full bandwidth utilization of slowest TLS link in every circuit 2. Queue length minimization at the relay sending on this TLS link 3. Fairness between circuits (esp partially overlapping ones) 4. Rapid convergence to full circuit capacity These requirements mirror the requirements used in TCP evaluation. TCP implementations compete over metrics in each of these areas, and algorithm design is far more art than formally proved optimal or even correct. That does not mean that formalism is impossible, however. For a great formal breakdown of this design space without all of TCP's baggage, see Section 3 of the XCP paper. Note that we can't use XCP as-is because its congestion signal fields provide way too much information for use in potential side-channels, but it is worth a read because it clarifies the control theory behind congestion control: http://conferences.sigcomm.org/sigcomm/2002/papers/xcp.pdf In terms of how this design space is approached in TCP: utilization and queue minimization are typically handled simultaneously - sometimes explicitly and sometimes implicitly. Similarly, fairness is sometimes explicitly designed for, but more often fairness is an emergent property that is analyzed when competing against a benchmark (historically: TCP Reno). Fairness abuse is also an area where cheating can occur that enables cheaters to get more than their fair share of bandwidth, and such cheating can be very hard to detect and prevent on the Internet. Interestingly, many of the design goals of TCP are *also* improved via queue management algorithms at routers (Section III of this mail). In fact, our Circuit-EWMA circuit scheduler will effectively hand out RTT congestion signals (delay) to loud circuits before quiet ones, and will *also* enforce fairness and impede cheating, by delaying a cheater's cells even more. In this sense, we do not need to worry quite so much about fairness and cheating as security properties, but a lack of fairness at the TCP algorithm will *still* increase memory use in relays to queue these unfair/loud circuits. So we should still be mindful of these properties in selecting our congestion algorithm, to minimize relay memory use, if nothing else. For a good summary of several TCP congestion control algorithms, see: http://intronetworks.cs.luc.edu/1/html/reno.html ("Slow Start+AIMD") http://intronetworks.cs.luc.edu/1/html/newtcps.html (modern TCPs) As you can see, the TCP design space is huge. A good way to break down the TCP space is to separate the drop-based algorithms from the delay-based algorithms. TCP drop-based algorithms require either packet drops or ECN packet marking to signal congestion. Delay based algorithms infer congestion from RTT measurements only. But as we shall see, this distinction is artificial for us. BOOTLEG_RTT_TOR effectively converts delay *into* an explicit congestion signal, as far as congestion window control algorithms are concerned. Similarly, either of our ECN signals can be used in place of the drop signal in the algorithms below. I'm going to review four options that strike me as most relevant to our needs. Two are drop-based: TCP Reno, and TCP CUBIC. Two are delay-based: TCP Vegas, TCP Westwood. It should be noted that there are also hybrid algorithms that use drops/ECN in addition to delay as congestion signals, but since we are focusing on primarily using RTT only as our signal, I won't be covering them. For a good overview of those options, see: https://arxiv.org/pdf/1903.03852.pdf Toke Høiland-Jørgensen suggested that we also investigate BBR, which is a hybrid algorithm currently being evaluated by DropBox and Google. However, BBR involves packet send time scheduling, which will require replacement of Circuit-EWMA and other extensive changes. Furthermore, BBR is still under active development, experimentation, and tuning. In fact, while v2 has been tested by Dropbox, no IETF draft specification exists yet. Still, for completeness: https://tools.ietf.org/html/draft-cardwell-iccrg-bbr-congestion-control-00 https://queue.acm.org/detail.cfm?id=3022184 https://datatracker.ietf.org/doc/slides-106-iccrg-update-on-bbrv2/ https://dropbox.tech/infrastructure/evaluating-bbrv2-on-the-dropbox-edge-ne… I am also going to simplify and omit treatment of packet loss and duplicate acks, since we will not be dropping any packets. Additionally, I have normalized all formulas below to be in terms of Tor cells, rather than bytes. The hope is that these simplifications help us get to the essence of what these algorithms would look like when deployed on Tor, without getting distracted by irrelevant drop and ack synchronization management details from TCP/IP. However, as we move closer to specifying an actual design proposal, we should double-check my simplifications below, for the algorithm(s) that we actually specify. Section II.A: TCP Reno [TCP_RENO_TOR] https://tools.ietf.org/html/rfc5681 http://intronetworks.cs.luc.edu/1/html/reno.html Until recently, TCP Reno was the most popular TCP on the Internet, and it is still used as a fairness benchmark. While we won't be using it directly, it is thus important to understand. Like all TCPs, TCP Reno satisfies its utilization and queue minimization design goals by trying to make its congestion window closely match the bandwidth-delay product of the link. More importantly, when multiple TCP connections are present competing for the bandwidth of a link, TCP Reno fairly distributes the bandwidth-delay product of the link equally among all competing TCP Reno connections. http://pbg.cs.illinois.edu/courses/cs598fa09/slides/05-Ashish.pdf https://www.cse.wustl.edu/~jain/papers/ftp/cong_av.pdf TCP Reno probes the bandwidth-delay product by increasing its congestion window until it overflows the the queue capacity of the slowest router on a path, causing a packet drop or ECN signal, upon which TCP Reno reduces its congestion window. The initial phase of TCP Reno is called "Slow Start". For our purposes, Slow Start in TCP Reno means that every SENDME ack, the congestion window cwnd becomes: cwnd = cwnd + SENDME_RATE (recall Tor's SENDME_RATE is 100 cells) Because increasing the congestion window by 100 every SENDME allows the sender to send 100 *more* 512 byte cells every 100 cells, this is an exponential function that causes cwnd to double every cwnd cells. TCP Reno keeps on doubling until it experiences a congestion signal. Once a congestion signal is experienced, Slow Start is exited, and the Additive-Increase-Multiplicative-Decrease (AIMD) steady-state phase begins. AIMD algorithms can be generalized with two parameters: the add quantity, and the multiply quantity. This is written as AIMD(a,m). TCP Reno uses AIMD(1, 0.5). It should be noted that most later TCP's include an AIMD component of some kind, but tend to use 0.7 for m instead of 0.5. This allows them to recover faster from congestion. After slow start exits, in steady-state, after every ack (SENDME) response without a congestion signal, the window is updated as: cwnd = cwnd + SENDME_RATE/cwnd This comes out to increasing cwnd by 1, every time cwnd cells are successfully sent without a congestion signal occurring. Thus this is additive linear growth, not exponential growth. If there was a congestion signal, cwnd is updated as: cwnd = cwnd * m (m=0.5 or 0.7) This is called "Fast Recovery". If you dig into the RFCs, actual TCP Reno has some additional details for responding to multiple packet losses that in some cases can fall back into slow-start. However, for our purposes, the congestion window should only be updated once per cwnd cells, in either the congestion case, or the non-congestion case. TCP Reno (and all TCPs) also revert to slow start when the connection has been idle, in case bottleneck capacity has changed in this time. We may or may not want to do this, as slow start has a dramatic impact on page load times. TCP Reno is still used as the benchmark to determine fairness. If an algorithm is "TCP Friendly", this basically means that when competing with TCP Reno to use spare path capacity, it does not get its ass kicked, and it does not kick Reno's ass. Several algorithms contain hacks or special cases to preserve this property. If this fairness property holds, a bunch of math can prove convergence rates, utilization, congestion window sizes, and even queue sizes. Modern queue management also relies on this math to auto-tune queue parameters, as we will see in Section III. http://pbg.cs.illinois.edu/courses/cs598fa09/slides/05-Ashish.pdf https://www.icir.org/tfrc/aimd.pdf Section II.B: TCP Cubic [TCP_CUBIC_TOR] http://intronetworks.cs.luc.edu/1/html/newtcps.html#tcp-cubic https://pandorafms.com/blog/tcp-congestion-control/ https://tools.ietf.org/html/rfc8312 TCP Reno unfortunately does not perform optimally for networks where the natural packet loss rate starts to become frequent, which happens for both high bandwidth and high latency networks. This shortcoming has lead to a huge explosion of competition among replacements and improvements for this use case, and TCP Cubic has won out (for now). TCP Cubic basically takes TCP Reno and tweaks the additive portion of AIMD behavior to be a cubic function (y=x^3) instead of linearly additive. The cubic function was chosen because it is concave, then flat, and then convex. The concave piece allows for quickly recovering after congestion (or non-congestion-related packet loss), and the convex piece allows for gently probing for fresh spare link capacity. Because of adoption by both the Linux kernel and the Windows kernel as the default TCP, TCP Cubic is now the most popular TCP on the Internet. Unfortunately, TCP Cubic is complicated. It has multiple magic numbers, and is defined piece wise depending on how its congestion window would compare to the equivalent TCP Reno window (to ensure the "TCP Friendly" property -- without this hack, TCP Cubic is "too polite" and TCP Reno eats its lunch). It additionally has time-based components that must be clocked relative to current measured RTT (though it does not use RTT as a congestion signal). The cubic congestion window is defined as: W_max = previous window size before last congestion signal W_cubic(t) = C*(t-K)^3 + W_max With magic number constants: C = 0.4 beta_cubic = 0.7 K = cubic_root(W_max*(1-beta_cubic)/C) To maintain competitiveness with TCP Reno, Cubic keeps a running estimate of an equivalent TCP Reno's window size, using the AIMD math cited for Reno above: W_est(t) = W_max*beta_cubic + [3*(1-beta_cubic)/(1+beta_cubic)] * (t/RTT) if W_est(t) > W_cubic(t): cwnd = W_est(t) else: cwnd = cwnd + (W_cubic(t+RTT) - cwnd)/cwnd Upon congestion signal, Cubic uses its beta_cubic parameter (0.7), and thus does not back off quite as far as Reno: W_max = cwnd cwnd = cwnd * beta_cubic It is pretty easy to convince yourself that Cubic will do as least as well as TCP Reno (thanks to the AIMD Reno window comparison hack), but it is very hard to conceptualize the above and convince yourself that it is the best we could possibly do for our use case. Especially since Tor is *not* affected by an increasing inherent drop rate as bandwidth increases, unlike the Internet. For these reasons, the complexity of TCP Cubic does not strike me as worthwhile, as compared to other, simpler candidates. Section II.C: TCP Vegas and [TCP_VEGAS_TOR] [PROPOSAL_CANDIDATE] http://pages.cs.wisc.edu/~akella/CS740/F08/740-Papers/BOP94.pdf ftp://ftp.cs.princeton.edu/techreports/2000/628.pdf (lol ftp, sorry) Similar to BOOTLEG_RTT_TOR from my Options mail, TCP Vegas was designed to make use of RTT and bandwidth estimation to avoid queue congestion, and thus avoid any dropped packets or ECN signals. However, TCP Vegas actually directly estimates queue length of the bottleneck router, and aims to minimize the packets in that queue. To accomplish this, TCP Vegas maintains two RTT measurements: RTT_current and RTT_min. It also maintains a bandwidth estimate. The bandwidth estimate is the current congestion window size divided by the RTT estimate: BWE = cwnd/RTT_current (Note: In later TCP Vegas relatives like TCP Westwood, this bandwidth estimate was improved to use an explicit count of the unacked packets in transit, instead of cwnd). Recall that in order to utilize the full capacity of the link, all TCPs attempt to maintain its congestion window as close to the bandwidth-delay product (ie: the "memory capacity") of the transmission path as possible. The extra queue use along a path can thus be estimated by first estimating the path's bandwidth-delay product: BDP = BWE*RTT_min TCP Vegas then estimates the queue use caused by congestion as: queue_use = cwnd - BDP = cwnd - cwnd*RTT_min/RTT_current = cwnd * (1 - RTT_min/RTT_current) If queue_use drops below a threshold alpha (typically 2-3 packets for TCP, but perhaps double or triple that for our smaller cells), then the congestion window is increased. If queue_use exceeds a threshold beta (typically 4-6 packets, but again we should probably double or triple this), then the congestion window is decreased. This update happens only once per cwnd packets: if queue_use < alpha: cwnd = cwnd + 1 elif queue_use > beta: cwnd = cwnd - 1 As a backstop, TCP Vegas still will half its congestion window in the event that a congestion signal (packet drop or ECN) still occurs. However, such packet drops should not actually even happen, unless Vegas is fighting with a less fair control algorithm that is causing drops instead of backing off based on queue delay. Vegas also uses this queue_use estimate to govern its initial slow start behavior. During slow start, the Vegas congestion window grows exponentially, but only half as fast as Reno (to avoid excessive congestion). For us, this means that cwnd is updated every SENDME ack by: if queue_use < gamma: cwnd = cwnd + SENDME_RATE/2 This exponential increase continues until the queue_use estimate exceeds "gamma" (which is determined experimentally and sadly not listed in the TCP Vegas paper). If I had to guess, gamma is probably close to beta (4-6 packets). Because Vegas tries much harder to avoid congestion than TCP Reno, Vegas is vulnerable to cheating through fairness abuse. When TCP Vegas competes against a TCP Reno-style TCP that responds only to packet drops (or ECN signals in our case) and ignores RTT, TCP Vegas detects congestion earlier than TCP Reno would, and ends up yielding bandwidth to TCP Reno. This means that Vegas does *not* have the "TCP Friendly" property that modern AQM algorithms depend on for their parameterization. Note that we should expect the same thing to happen if TCP Vegas competed with BOOTLEG_RTT_TOR algorithm run by cheaters. However, since RTT is being used as the congestion signal in both cases, Circuit-EWMA should still balance this out, but we should experiment with this to confirm. There may still be additional queuing memory costs in the presence of cheaters, compared to either pure [TCP_VEGAS_TOR] or pure BOOTLEG_RTT_TOR. Even so: this form of cheating is also pretty easy for honest Exit nodes to detect, and cheating clients only get benefit in the upload direction, even if they did cheat. To see this a bit more concretely, if we sub in TCP Reno in BOOTLEG_RTT_TOR, during slow start we have: T=(1−a)*rtt_min+a*rtt_max If rtt < T: cwnd = cwnd + SENDME else: exit slow start And then for steady state: If rtt < T: cwnd = cwnd + SENDME_RATE/cwnd else: cwnd = cwnd * 0.5 The 'a' parameter from BOOTLEG_RTT_TOR is operating similarly to the queue_len criterion of [TCP_VEGAS_TOR], but the window update rules are [TCP_RENO_TOR], which are more aggressive than TCP Vegas. So effectively, this is still [TCP_VEGAS_TOR] with a different target condition for queue_len. This means we could expect BOOTLEG_RTT_TOR to drive the congestion to some higher rate of queue use than [TCP_VEGAS_TOR], and senders who use it may out-compete Vegas much like TCP Reno out-competes Vegas, until Circuit-EWMA starts punishing them (at the expense of more queue memory waste). This also shows us a crucial testing consideration: [TCP_VEGAS_TOR] may not compete well with our current SENDME congestion control, since it's ability to set queue_target may cause it to de-prioritize itself below competing SENDME flows. BOOTLEG_RTT_TOR, on the other hand, will allow us to set its 'a' parameter at whatever point makes it competitive with SENDME. For this reason, BOOTLEG_RTT_TOR may yield better performance characteristics until all clients upgrade. Section II.D: TCP Westwood [TCP_WESTWOOD_TOR] [PROPOSAL_CANDIDATE] https://tools.ietf.org/html/rfc8290#section-4.1 http://intronetworks.cs.luc.edu/1/html/newtcps.html#tcp-westwood TCP Westwood is essentially TCP Reno, but if the current BDP estimate is greater than half the current congestion window, TCP Westwood uses that instead of halving the congestion window, during a congestion signal. In other words, upon a congestion signal, TCP Westood does: cwnd = max(cwnd * 0.5, BDP_estimate) This has the effect of recovering from congestion events much quicker than TCP Reno, and thus utilizing more of the path capacity. Westwood+ also has some improvements on TCP Vegas's BDP_estimate in order to smooth it in the presence of delayed and/or dropped acks: BWE_k = a*BWE_k-1 + (1−a)*BWM_k We may not need this smoothing as much, because we do not have any dropped acks, and dropped acks were far more damaging to TCP Vegas than slightly delayed acks. Section III: Queue Management Algorithms [QUEUE_MANAGEMENT] On the Internet, Queue Management Algorithms are used by routers to decide how long their queues should be, which packets they should drop and when, and which connections to send ECN signals on and when. Queue management is almost as large of a research area as congestion control. Thanks to bittorrent and video streaming, queue management also grew to consider fairness or QoS between flows. This is done by separating queues or imposing a scheduling algorithm on packet ordering. For an excellent overview of fairness vs queue management, see RFC 7806: https://tools.ietf.org/html/rfc7806 Since Tor relays cannot drop packets, our design goals are slightly different than Internet queue management. We care about the following design goals: 1. Prioritize well-behaved/light circuits over loud/heavy ones 2. Decide when to send ECN signals 3 Decide which circuits to send ECN signals on 4. Detect which circuits are queuing too much (or otherwise cheating) Recall that Tor has already deployed Circuit-EWMA in order to meet goal #1. Circuit-EWMA tries to give priority to less loud circuits in the circuit scheduler, so that interactive applications experience less latency than bulk transfers. https://www.cypherpunks.ca/~iang/pubs/ewma-ccs.pdf Circuit-EWMA can be viewed as a fairness algorithm as per RFC 7806 Section 2, since it schedules flows independently with the target fairness property of prioritizing quiet circuits cells over loud ones. When Circuit-EWMA is used in combination with RTT as a congestion signal, it is effectively *also* a marking algorithm. This is because it *also* distributes RTT congestion signals more frequently to loud circuit: https://tools.ietf.org/html/rfc7806#section-2 https://tools.ietf.org/html/rfc7806#section-3 Recall that Tor also had to deploy KIST, to move queues from the Linux kernel into Tor, for Circuit-EWMA to properly prioritize them. See Section 4 and 5 of the full-length KIST paper for details on this interaction, and see page 5 for a good graphical depiction of all the queuing points in Tor and the Kernel: https://matt.traudt.xyz/static/papers/kist-tops2018.pdf It is likely the case that Circuit-EWMA and KIST are all we need if we only use RTT signaling. But if we add in ECN, at minimum we will need to update Circuit-EWMA to inform us which circuit(s) to send ECN signals on. In fact, thanks to RFC7806's distinction between fairness and management, modern Internet queue management algorithms compose well with Circuit-EWMA, allowing us to still use Circuit-EWMA for the fairness portion, but use an existing Internet algorithm for the ECN marking. So it is useful to review Internet history here, as well as current state-of-the-art. On the Internet, queue management algorithms arose fairly early for one primary reason: early routers performed what is called "tail dropping". Tail dropping means that routers allowed their queues to get full, and then started dropping all packets that arrived while queue remained full. This has all kinds of negative consequences, but the easiest to intuitively understand is that if you wait until your queue is full before you start dropping packets, and then drop all of the packets that arrive from then on, competing TCP connections will all back off at the same time, leading to lulls in network utilization. The key innovation to address tail dropping was to randomly drop (or ECN mark) packets early, before the queue was full, with a per-packet drop probability. This probability was computed based on many factors of the link, and sometimes based on individual TCP connection tracking. Until very recently, both queue management and fair queuing algorithms tended to have many knobs that must be tuned for specific kinds of links and load levels. The Internet seems to be converging on three main contenders for "knobless" queue management: PIE, ARED, and CoDel. Here's some recent review material: http://content.ikr.uni-stuttgart.de/Content/Publications/Archive/Wa_EUNICE_… https://datatracker.ietf.org/meeting/88/materials/slides-88-iccrg-4/ However, be aware that these knobless systems make assumptions about the congestion control algorithms and the latency of the connections that may not apply to us. The assumptions are most clearly specified in Section 3.2 of the CoDel RFC, and are closely related to the "TCP Fairness vs TCP Reno" assumption from our congestion control review: https://tools.ietf.org/html/rfc8289#section-3.2 Ok, so let's review the top "knobless" contenders. Section III.A: PIE and ARED [ARED_PIE_TOR] https://tools.ietf.org/html/rfc8033 http://www.icir.org/floyd/papers/adaptiveRed.pdf As a quick summary, both PIE and ARED manage their queues by dropping packets probabalistically. In both systems, the per-packet drop probability is updated in proportion to average queue transit time -- longer queues mean the drop probability is increased. Both systems decide to drop individual packets from the queue in using this probability. This has the effect that louder flows (with more packets) are also more likely to experience drops (or ECN congestion marking). But otherwise they have no explicit notion of fairness. This makes them marking algorithms as per RFC7806's classification. If explicit QoS or fairness is desired, they have to be deployed in combination with a fairness algorithm to prioritize between separate queues, again as per RFC 7806 Section 3.3: https://tools.ietf.org/html/rfc7806#section-3.3 Recall that Tor has multiple levels of queues: Circuit, TLS connections, and kernel buffers, as per Figure 1 on page 5 in the KIST paper: https://matt.traudt.xyz/static/papers/kist-tops2018.pdf Even if we keep Circuit-EWMA for scheduling (and thus fairness), we can still add in PIE or ARED per each TLS connection. If the average total queue length for all circuits on a TLS connection grows beyond the marking threshold for PIE/ARED, we could then begin sending congestion signals on circuits, as they send cells, as per the cell drop probability of these algorithms. However, because we only want to send one BACKWARD_ECN_TOR signal per circuit, we will have to save these signals for circuits that have not yet gotten one. It is an open question how we should manage the mark probability if we have to pass over a circuit because we already sent a backward ECN signal on it. Section III.B: CoDel and FQ_CoDel [CODEL_TOR] [PROPOSAL_CANDIDATE] https://queue.acm.org/detail.cfm?id=2209336 https://tools.ietf.org/html/rfc8289 https://tools.ietf.org/html/rfc8290 CoDel is a relatively recent algorithm that is gaining traction because of its simple elegance. Like PIE and ARED, CoDel itself is a marking algorithm, and can be combined with any fairness/QoS algorithm. It also comes with its own fairness algorithm, called FQ_CoDel. https://tools.ietf.org/html/rfc7806#section-3.2 Despite being touted as parameterless, CoDel does have two parameters: inspection_interval <- set above max RTT (but not too high) min_queue_target <- set to 5% of min RTT CoDel tags packets with timestamps as soon as they enter the queue. Every inspection_interval, it examines the queue to see if any packets have waited longer than target_delay. If they have, it begins dropping packets (sending ECN signals in our case). Because CoDel is accurately tracking the time *each* packet spends in the queue, it is able to differentiate between "good" queues (not congested), vs "bad" queues (congested), on a packet-by-packet basis. This good vs bad distinction is determined by comparing the packet transit time to min_queue_target. If a packet exceeds this transit time, then the queue is bad, and the packet must be dropped or ECN marked. It is this per-packet distinction that significantly separates CoDel from PIE and ARED, which both use average queue times. This was the source of some controversy, as per-packet timestamping was initially argued to be too expensive. But on modern commodity hardware, this per-packet timestamping is actually incredibly efficient. It also allows CoDel to converge on the min_queue_target very rapidly, with high link utilization. However, it would seem that because we can't drop packets, sojourn times will *not* decrease down to min_queue_target immediately upon congestion for Tor. We will have to wait for the actual backoff signal to propagate (1 RTT) before the queue size changes due to backoff. According to Toke Høiland-Jørgensen (one of the FQ_CoDel spec authors), this should not be that significant, but it should be tested. FQ_CoDel extends CoDel to provide fairness in a very similar way as Circuit-EWMA does, by separating out TCP connections into "generated a queue recently" and "did not generate a queue recently". This allows packets from flows that are not causing congestion to be sent more quickly than flows that are. This also has the side effect that when connections build long FQ queues, their packets spend more time in queues, and have an increased probability of being ECN marked or dropped. https://ieeexplore.ieee.org/document/8469111 Like CoDel, FQ_CoDel relies on the assumption that it can drop a lot of packets at once to correct queue sizes instantly: https://tools.ietf.org/html/rfc8290#section-4.1 This makes me think we should not rush to replace Circuit-EWMA with FQ_CoDel, but we may get benefit from doing CoDel on a per-TLS connection basis, just as is described for PIE and ARED above. In this case, it is more clear which circuits to ECN mark. Whenever a cell has a Circuit-EWMA circuitmux queue transit time longer than the target queue time threshold (min_queue_target), that circuit gets an ECN mark. If it has already been ECN marked, no biggie. In other words: The decision to ECN mark circuits is completely dependent only on how long their cells traverse our queues, and does not require any capacity estimates or other measurements. Very clean. So if we decide we want to try ECN, I think Circuit-EWMA with CoDel is a smart first choice. We would then use CoDel to deliver exactly one BACKWARD_ECN_TOR signal when it detects a circuit's queues are too long. This RTT/2 signal should cause the circuit to exit slow start more quickly than RTT measurement by endpoints will. Section IV: Next Steps [NEXT_STEPS] After this extended review, I am convinced that an RTT-based approach can provide a comprehensive solution to all of our design requirements, and can be enhanced based on experimentation and further additions. Such an approach also only requires clients and Exits (or just clients and onion services) to upgrade to initially deploy. However, we can still improve on this with a few enhancements. In particular, we can compare [TCP_VEGAS_TOR] to [TCP_WESTWOOD_TOR] to BOOTLEG_RTT_TOR, and see how they perform in competition with each other, and in competition with Tor's currently fixed-size SENDME windows. We can also implement CoDel in circuitmux to track transit times and determine which circuits to send a single BACKWARD_ECN_TOR signal, to exit slow start on circuits more quickly, and we can play around with higher initial window sizes. We can do this experimentation first in Shadow, and also on the live network via consensus parameters. The immediate next steps here are to create a proper Tor Proposal from the sections tagged [PROPOSAL_CANDIDATE] above. -- Mike Perry

1 0

Proposal XXX: FlashFlow: A Secure Speed Test for Tor (Parent Proposal)
by Matt Traudt 02 Jun '20

02 Jun '20

Filename: xxx-flashflow.txt Title: FlashFlow: A Secure Speed Test for Tor (Parent Proposal) Author: Matthew Traudt, Aaron Johnson, Rob Jansen, Mike Perry Created: 23 April 2020 Status: Draft 1. Introduction FlashFlow is a new distributed bandwidth measurement system for Tor that consists of a single authority node ("coordinator") instructing one or more measurement nodes ("measurers") when and how to measure Tor relays. A measurement consists of the following steps: 1. The measurement nodes demonstrate to the target relay permission to perform measurements. 2. The measurement nodes open many TCP connections to the target relay and create a one-hop circuit to the target relay on each one. 3. For 30 seconds the measurement nodes send measurement cells to the target relay and verify that the cells echoed back match the ones sent. During this time the relay caps the amount of background traffic it transfers. Background and measurement traffic are handled separately at the relay. Measurement traffic counts towards all the standard existing relay statistics. 4. For every second during the measurement, the measurement nodes report to the authority node how much traffic was echoed back. The target relay also reports the amount of per-second background (non-measurement) traffic. 5. The authority node sums the per-second reported throughputs into 30 sums (one for each second) and calculates the median. This is the estimated capacity of the relay. FlashFlow performs a measurement of every relay according to a schedule described later in this document. Periodically it produces relay capacity estimates in the form of a v3bw file, which is suitable for direct consumption by a Tor directory authority. Alternatively an existing load balancing system such as Simple Bandwidth Scanner could be modified to use FlashFlow's v3bw file as input. It is envisioned that each directory authority that wants to use FlashFlow will run their own FlashFlow deployment consisting of a coordinator that they run and one or more measurers that they trust (e.g. because they run them themselves), similar to how each runs their own Torflow/sbws. Section 5.2 of this proposal describes long term plans involving multiple FlashFlow deployments. FlashFlow is more performant than Torflow: FlashFlow takes 5 hours to measure the entire existing Tor network from scratch (with 3 Gbit/s measurer capacity) while Torflow takes 2 days; FlashFlow measures relays it hasn't seen recently as soon as it learns about them (i.e. every new consensus) while Torflow can take a day or more; and FlashFlow accurately measures new high-capacity relays the first time and every time while Torflow takes days/weeks to assign them their full fair share of bandwidth (especially for non-exits). FlashFlow is more secure than Torflow: FlashFlow allows a relay to inflate its measured capacity by up to 1.33x (configured by a parameter) while Torflow allows weight inflation by a factor of 89x [0] or even 177x [1]. After an overview in section 2 of the planned deployment stages, section 3, 4, and 5 discuss the short, medium, and long term deployment plans in more detail. 2. Deployment Stages FlashFlow's deployment shall be broken up into three stages. In the short term we will implement a working FlashFlow measurement system. This requires code changes in little-t tor and an external FlashFlow codebase. The majority of the implementation work will be done in the short term, and the product is a complete FlashFlow measurement system. Remaining pieces (e.g. better authentication) are added later for enhanced security and network performance. In the medium term we will begin collecting data with a FlashFlow deployment. The intermediate results and v3bw files produced will be made available (semi?) publicly for study. In the long term experiments will be performed to study ways of using FF v3bw files to improve load balancing. Two examples: (1) using FF v3bw files instead of sbws's (and eventually phasing out torflow/sbws), and (2) continuing to run sbws but use FF's results as a better estimate of relay capacity than observed bandwidth. Authentication and other FlashFlow features necessary to make it completely ready for full production deployment will be worked on during this long term phase. 3. FlashFlow measurement system: Short term The core measurement mechanics will be implemented in little-t tor, but a separate codebase for the FlashFlow side of the measurement system will also be created. This section is divided into three parts: first a discussion of changes/additions that logically reside entirely within tor (essentially: relay-side modifications), second a discussion of the separate FlashFlow code that also requires some amount of tor changes (essentially: measurer-side and coordinator-side modifications), and third a security discussion. 3.1 Little-T Tor Components The primary additions/changes that entirely reside within tor on the relay side: - New torrc options/consensus parameters. - New cell commands. - Pre-measurement handshaking (with a simplified authentication scheme). - Measurement mode, during which the relay will echo traffic with measurers, set a cap on the amount of background traffic it transfers, and report the amount of transferred background traffic. 3.1.1 Parameters FlashFlow will require some consensus parameters/torrc options. Each has some default value if nothing is specified; the consensus parameter overrides this default value; the torrc option overrides both. FFMeasurementsAllowed: A global toggle on whether or not to allow measurements. Even if all other settings would allow a measurement, if this is turned off, then no measurement is allowed. Possible values: 0, 1. Default: 0 (disallowed). FFAllowedCoordinators: The list of coordinator TLS certificate fingerprints that are allowed to start measurements. Relays check their torrc when they receive a connection from a FlashFlow coordinator to see if it's on the list. If they have no list, they check the consensus parameter. If nether exist, then no FlashFlow deployment is allowed to measure this relay. Default: empty list. FFMeasurementPeriod: A relay should expect on average, to be measured by each FlashFlow deployment once each measurement period. A relay will not allow itself to be measured more than twice by a FlashFlow deployment in any time window of this length. Relays should not change this option unless they really know what they're doing. Changing it at the relay will not change how often FlashFlow will attempt to measure the relay. Possible values are in the range [1 hour, 1 month] inclusive. Default: 1 day. FFBackgroundTrafficPercent: The maximum amount of regular non-measurement traffic a relay should handle while being measured, as a percent of total traffic (measurement + non-measurement). This parameter is a trade off between having to limit background traffic and limiting how much a relay can inflate its result by handling no background traffic but reporting that it has done so. Possible values are in the range [0, 99] inclusive. Default: 25 (a maximum inflation factor of 1.33). FFMaxMeasurementDuration: The maximum amount of time, in seconds, that is allowed to pass from the moment the relay is notified that a measurement will begin soon and the end of the measurement. If this amount of time passes, the relay shall close all measurement connections and exit its measurement mode. Note this duration includes handshake time, thus it necessarily is larger than the expected actual measurement duration. Possible values are in the range [10, 120] inclusive. Default: 45. 3.1.2 New Cell Types FlashFlow will introduce a new cell command MEASURE. The payload of each MEASURE cell consists of: Measure command [1 byte] Length [2 bytes] Data [Length-3 bytes] The measure commands are: 0 -- MSM_PARAMS [forward] 1 -- MSM_PARAMS_OK [backward] 2 -- MSM_ECHO [forward and backward] 3 -- MSM_BG [backward] 4 -- MSM_ERR [forward and backward] Forward cells are sent from the measurer/coordinator to the relay. Backward cells are sent from the relay to the measurer/coordinator. MSM_PARAMS and MSM_PARAMS_OK are used during the pre-measurement stage to tell the target what to expect and for the relay to positively acknowledge the message. MSM_ECHO cells are the measurement traffic; the measurer generates them, sends them to the target, and the target echos them back. The target send a MSM_BG cell once per second to report the amount of background traffic it is handling. MSM_ERR cells are used to signal to the other party that there has been some sort of problem and that the measurement should be aborted. These measure commands are described in more detail in the next section. The only cell that sometimes undergoes cell encryption is MSM_ECHO; no other cell ever gets cell encrypted. (All cells are transmitted on a regular TLS-wrapped OR connection; that encryption still exists.) The relay "decrypts" MSM_ECHO cells before sending them back to the measurer; this mirrors the way relays decrypt/encrypt RELAY_DATA cells in order to induce realistic cryptographic CPU load. The measurer usually skips encrypting MSM_ECHO cells to reduce its own CPU load; however, to verify the relay is actually correctly decrypting all cells, the measurer will choose random outgoing cells, encrypt them, remember the ciphertext, and verify the corresponding incoming cell matches. 3.1.3 Pre-Measurement Handshaking/Starting a Measurement The coordinator connects to the target relay and sends it a MSM_PARAMS cell. If the target is unwilling to be measured at this time or if the coordinator didn't use a TLS certificate that the target trusts, it responds with an error cell and closes the connection. Otherwise it checks that the parameters of the measurement are acceptable (e.g. the version is acceptable, the duration isn't too long, etc.). If the target is happy, it sends a MSM_PARAMS_OK, otherwise it sends a MSM_ERR and closes the connection. Upon learning the IP addresses of the measurers from the coordinator in the MSM_PARAMS cell, the target whitelists their IPs in its DoS detection subsystem until the measurement ends (successfully or otherwise), at which point the whitelist is cleared. Upon receiving a MSM_PARAMS_OK from the target, the coordinator will instruct the measurers to open their TCP connections with the target. If the coordinator or any measurer receives a MSM_ERR, it reports the error to the coordinator and considers the measurement a failure. It is also a failure if any measurer is unable to open at least half of its TCP connections with the target. The payload of MSM_PARAMS cells [XXX more may need to be added]: - version [1 byte] - msm_duration [1 byte] - num_measurers [1 byte] - measurer_info [num_measurers times] - ipv4_addr [4 bytes] - num_conns [2 bytes] version dictates how this MSM_PARAMS cell shall be parsed. msm_duration is the duration, in seconds, that the actual measurement will last. num_measurers is how many measurer_info structs follow. For each measurer, the ipv4_addr it will use when connecting to the target is provided, as is num_conns, the number of TCP connections that measurer will open with the target. Future versions of FlashFlow and MSM_PARAMS will use TLS certificates instead of IP addresses. MSM_PARAMS_OK has no payload: it's just padding bytes to make the cell 514 bytes long. The payload of MSM_ECHO cells: - arbitrary bytes [max to fill up 514 byte cell] The payload of MSM_BG cells: - second [1 byte] - sent_bg_bytes [4 bytes] - recv_bg_bytes [4 bytes] second is the number of seconds since the measurement began. MSM_BG cells are sent once per second from the relay to the FlashFlow coordinator. The first cell will have this set to 1, and each subsequent cell will increment it by one. sent_bg_bytes is the number of background traffic bytes sent in the last second (since the last MSM_BG cell). recv_bg_bytes is the same but for received bytes. The payload of MSM_ERR cells: - err_code [1 byte] - err_str [possibly zero-len null-terminated string] The error code is one of: [... XXX TODO ...] 255 -- OTHER The error string is optional in all cases. It isn't present if the first byte of err_str is null, otherwise it is present. It ends at the first null byte or the end of the cell, whichever comes first. 3.1.4 Measurement Mode The relay considers the measurement to have started the moment it receives the first MSM_ECHO cell from any measurer. At this point, the relay - Starts a repeating 1s timer on which it will report the amount of background traffic to the coordinator over the coordinator's connection. - Enters "measurement mode" and limits the amount of background traffic it handles according to the torrc option/consensus parameter. The relay decrypts and echos back all MSM_ECHO cells it receives on measurement connections until it has reported its amount of background traffic the same number of times as there are seconds in the measurement (e.g. 30 per-second reports for a 30 second measurement). After sending the last MSM_BG cell, the relay drops all buffered MSM_ECHO cells, closes all measurement connections, and exits measurement mode. During the measurement the relay targets a ratio of background traffic to measurement traffic as specified by a consensus parameter/torrc option. For a given ratio r, if the relay has handled x cells of measurement traffic recently, Tor then limits itself to y = xr/(1-r) cells of non-measurement traffic this scheduling round. The target will enforce that a minimum of 10 Mbit/s of measurement traffic is recorded since the last background traffic scheduling round to ensure it always allows some minimum amount of background traffic. 3.2 FlashFlow Components The FF coordinator and measurer code will reside in a FlashFlow repository separate from little-t tor. There are three notable parameters for which a FF deployment must choose values. They are: - The number of sockets, s, the measurers should open, in aggregate, with the target relay. We suggest s=160 based on the FF paper. - The bandwidth multiplier, m. Given an existing capacity estimate for a relay, z, the coordinator will instruct the measurers to, in aggregate, send m*z Mbit/s to the target relay. We recommend m=2.25. - The measurement duration, d. Based on the FF paper, we recommend d=30 seconds. The rest of this section first discusses notable functions of the FlashFlow coordinator, then goes on to discuss FF measurer code that will require supporting tor code. 3.2.1 FlashFlow Coordinator The coordinator is responsible for scheduling measurements, aggregating results, and producing v3bw files. It needs continuous access to new consensus files, which it can obtain by running an accompanying Tor process in client mode. The coordinator has the following functions, which will be described in this section: - result aggregation. - schedule measurements. - v3bw file generation. 3.2.1.1 Aggregating Results Every second during a measurement, the measurers send the amount of verified measurement traffic they have received back from the relay. Additionally, the relay sends a MSM_BG cell each second to the coordinator with amount of non-measurement background traffic it is sending and receiving. For each second's reports, the coordinator sums the measurer's reports. The coordinator takes the minimum of the relay's reported sent and received background traffic. If, when compared to the measurer's reports for this second, the relay's claimed background traffic is more than what's allowed by the background/measurement traffic ratio, then the coordinator further clamps the relay's report down. The coordinator adds this final adjusted amount of background traffic to the sum of the measurer's reports. Once the coordinator has done the above for each second in the measurement (e.g. 30 times for a 30 second measurement), the coordinator takes the median of the 30 per-second throughputs and records it as the estimated capacity of the target relay. 3.2.1.2 Measurement Schedule The short term implementation of measurement scheduling will be simpler than the long term one due to (1) there only being one FlashFlow deployment, and (2) there being very few relays that support being measured by FlashFlow. In fact the FF coordinator will maintain a list of the relays that have updated to support being measured and have opted in to being measured, and it will only measure them. The coordinator divides time into a series of 24 hour periods, commonly referred to as days. Each period has measurement slots that are longer than a measurement lasts (30s), say 60s, to account for pre- and post-measurement work. Thus with 60s slots there's 1,440 slots in a day. At the start of each day the coordinator considers the list of relays that have opted in to being measured. From this list of relays, it repeatedly takes the relay with the largest existing capacity estimate. It selects a random slot. If the slot has existing relays assigned to it, the coordinator makes sure there is enough additional measurer capacity to handle this relay. If so, it assigns this relay to this slot. If not, it keeps picking new random slots until one has sufficient additional measurer capacity. Relays without existing capacity estimates are assumed to have the 75th percentile capacity of the current network. If a relay is not online when it's scheduled to be measured, it doesn't get measured that day. 3.2.1.2.1 Example Assume the FF deployment has 1 Gbit/s of measurer capacity. Assume the chosen multiplier m=2. Assume there are only 5 slots in a measurement period. Consider a set of relays with the following existing capacity estimates and that have opted in to being measured by FlashFlow. - 500 Mbit/s - 300 Mbit/s - 250 Mbit/s - 200 Mbit/s - 100 Mbit/s - 50 Mbit/s The coordinator takes the largest relay, 500 Mbit/s, and picks a random slot for it. It picks slot 3. The coordinator takes the next largest, 300, and randomly picks slot 2. The slots are now: 0 | 1 | 2 | 3 | 4 -------|-------|-------|-------|------- | | 300 | 500 | | | | | The coordinator takes the next largest, 250, and randomly picks slot 2. Slot 2 already has 600 Mbit/s of measurer capacity reserved (300*m); given just 1000 Mbit/s of total measurer capacity, there is just 400 Mbit/s of spare capacity while this relay requires 500 Mbit/s. There is not enough room in slot 2 for this relay. The coordinator picks a new random slot, 0. 0 | 1 | 2 | 3 | 4 -------|-------|-------|-------|------- 250 | | 300 | 500 | | | | | The next largest is 200 and the coordinator randomly picks slot 2 again (wow!). As there is just enough spare capacity, the coordinator assigns this relay to slot 2. 0 | 1 | 2 | 3 | 4 -------|-------|-------|-------|------- 250 | | 300 | 500 | | | 200 | | The coordinator randomly picks slot 4 for the last remaining relays, in that order. 0 | 1 | 2 | 3 | 4 -------|-------|-------|-------|------- 250 | | 300 | 500 | 100 | | 200 | | 50 3.2.1.3 Generating V3BW files Every hour the FF coordinator produces a v3bw file in which it stores the latest capacity estimate for every relay it has measured in the last week. The coordinator will create this file on the host's local file system. Previously-generated v3bw files will not be deleted by the coordinator. A symbolic link at a static path will always point to the latest v3bw file. $ ls -l v3bw -> v3bw.2020-03-01-05-00-00 v3bw.2020-03-01-00-00-00 v3bw.2020-03-01-01-00-00 v3bw.2020-03-01-02-00-00 v3bw.2020-03-01-03-00-00 v3bw.2020-03-01-04-00-00 v3bw.2020-03-01-05-00-00 3.2.2 FlashFlow Measurer The measurers take commands from the coordinator, connect to target relays with many sockets, send them traffic, and verify the received traffic is the same as what was sent. Measurers need access to a lot of internal tor functionality. One strategy is to house as much logic as possible inside an compile-time-optional control port module that calls into other parts of tor. Alternatively FlashFlow could link against tor and call internal tor functions directly. [XXX for now I'll assume that an optional little-t tor control port module housing a lot of this code is the best idea.] Notable new things that internal tor code will need to do on the measurer (client) side: 1. Open many TLS+TCP connections to the same relay on purpose. 2. Verify echo cells. 3.2.2.1 Open many connections FlashFlow prototypes needed to "hack in" a flag in the open-a-connection-with-this-relay function call chain that indicated whether or not we wanted to force a new connection to be created. Most of Tor doesn't care if it reuses an existing connection, but FF does want to create many different connections. The cleanest way to accomplish this will be investigated. On the relay side, these measurer connections do not count towards DoS detection algorithms. 3.2.2.2 Verify echo cells A parameter will exist to tell the measurers with what frequency they shall verify that cells echoed back to them match what was sent. This parameter does not need to exist outside of the FF deployment (e.g. it doesn't need to be a consensus parameter). The parameter instructs the measurers to check 1 out of every N cells. The measurer keeps a count of how many measurement cells it has sent. It also logically splits its output stream of cells into buckets of size N. At the start of each bucket (when num_sent % N == 0), the measurer chooses a random index in the bucket. Upon sending the cell at that index (num_sent % N == chosen_index), the measurer records the cell. The measurer also counts cells that it receives. When it receives a cell at an index that was recorded, it verifies that the received cell matches the recorded sent cell. If they match, no special action is taken. If they don't match, the measurer indicates failure to the coordinator and target relay and closes all connections, ending the measurement. 3.2.2.2.1 Example Consider bucket_size is 1000. For the moment ignore cell encryption. We start at idx=0 and pick an idx in [0, 1000) to record, say 640. At idx=640 we record the cell. At idx=1000 we choose a new idx in [1000, 2000) to record, say 1236. At idx=1236 we record the cell. At idx=2000 we choose a new idx in [2000, 3000). Etc. There's 2000+ cells in flight and the measurer has recorded two items: - (640, contents_of_cellA) - (1236, contents_of_cellB) Consider the receive side now. It counts the cells it receives. At receive idx=640, it checks the received cell matches the saved cell from before. At receive idx=1236, it again checks the received cell matches. Etc. 3.2.2.2.2 Motivation A malicious relay may want to skip decryption of measurement cells to save CPU cycles and obtain a higher capacity estimate. More generally, it could generate fake measurement cells locally, ignore the measurement traffic it is receiving, and flood the measurer with more traffic that it (the measurer) is even sending. The security of echo cell verification is discussed in section 3.3.1. 3.3 Security In this section we discuss the security of various aspects of FlashFlow and the tor changes it requires. 3.3.1 Echo Cell Verification: Bucket Size A smaller bucket size means more cells are checked and FF is more likely to detect a malicious target. It also means more bookkeeping overhead (CPU/RAM). An adversary that knows bucket_size and cheats on one item out of every bucket_size items will have a 1/bucket_size chance of getting caught in the first bucket. This is the worst case adversary. While cheating on just a single item per bucket yields very little advantage, cheating on more items per bucket increases the likelihood the adversary gets caught. Thus only the worst case is considered here. In general, the odds the adversary can successfully cheat in a single bucket are (bucket_size-1)/bucket_size Thus the odds the adversary can cheat in X consecutive buckets are [(bucket_size-1)/bucket_size]^X In our case, X will be highly varied: Slow relays won't see very many buckets, but fast relays will. The damage to the network a very slow relay can do by faking being only slightly faster is limited. Nonetheless, for now we motivate the selection of bucket_size with a slow relay: - Assume a very slow relay of 1 Mbit/s capacity that will cheat 1 cell in each bucket. Assume a 30 second measurement. - The relay will handle 1*30 = 30 Mbit of traffic during the measurement, or 3.75 MB, or 3.75 million bytes. - Cells are 514 bytes. Approximately (e.g. ignoring TLS) 7300 cells will be sent/recv over the course of the measurement. - A bucket_size of 50 results in about 146 buckets over the course of the 30s measurement. - Therefore, the odds of the adversary cheating successfully as (49/50)^(146), or about 5.2%. This sounds high, but a relay capable of double the bandwidth (2 Mbit/s) will have (49/50)^(2*146) or 0.2% odds of success, which is quite low. Wanting a <1% chance that a 10 Mbit/s relay can successfully cheat results in a bucket size of approximately 125: - 10*30 = 300 Mbit of traffic during 30s measurement. 37.5 million bytes. - 37,500,000 bytes / 514 bytes/cell = ~73,000 cells - bucket_size of 125 cells means 73,000 / 125 = 584 buckets - (124/125)^(584) = 0.918% chance of successfully cheating Slower relays can cheat more easily but the amount of extra weight they can obtain is insignificant in absolute terms. Faster relays are essentially unable to cheat. 3.3.2 Weight Inflation Target relays are an active part of the measurement process; they know they are getting measured. While a relay cannot fake the measurement traffic, it can trivially stop transferring client background traffic for the duration of the measurement yet claim it carried some. More generally, there is no verification of the claimed amount of background traffic during the measurement. The relay can claim whatever it wants, but it will not be trusted above the ratio the FlashFlow deployment is configured to know. This places an easy to understand, firm, and (if set as we suggest) low cap on how much a relay can inflate its measured capacity. Consider a background/measurement ratio of 1/4, or 25%. Assume the relay in question has a hard limit on capacity (e.g. from its NIC) of 100 Mbit/s. The relay is supposed to use up to 25% of its capacity for background traffic and the remaining 75%+ capacity for measurement traffic. Instead the relay ceases carrying background traffic, uses all 100 Mbit/s of capacity to handle measurement traffic, and reports ~33 Mbit/s of background traffic (33/133 = ~25%). FlashFlow would trust this and consider the relay capable of 133 Mbit/s. (If the relay were to report more than ~33 Mbit/s, FlashFlow limits it to just ~33 Mbit/s.) With r=25%, FlashFlow only allows 1.33x weight inflation. Prior work shows that Torflow allows weight inflation by a factor of 89x [0] or even 177x [1]. The ratio chosen is a trade-off between impact on background traffic and security: r=50% allows a relay to double its weight but won't impact client traffic for relays with steady state throughput below 50%, while r=10% allows a very low inflation factor but will cause throttling of client traffic at far more relays. We suggest r=25% (and thus 1/(1-0.25)=1.33x inflation) for a reasonable trade-off between performance and security. It may be possible to catch relays performing this attack, especially if they literally drop all background traffic during the measurement: have the measurer (or some party on its behalf) create a regular stream through the relay and measure the throughput on the stream before/during/after the measurement. This can be explored longer term. 3.3.3 Incomplete Authentication The short term FlashFlow implementation has the relay set two torrc options if they would like to allow themselves to be measured: a flag allowing measurement, and the list of coordinator TLS certificate that are allowed to start a measurement. The relay drops MSM_PARAMS cells from coordinators it does not trust, and immediately closes the connection after that. A FF coordinator cannot convince a relay to enter measurement mode unless the relay trusts its TLS certificate. A trusted coordinator specifies in the MSM_PARAMS cell the IP addresses of the measurers the relay shall expect to connect to it shortly. The target adds the measurer IP addresses to a whitelist in the DoS connection limit system, exempting them from any configured connection limit. If a measurer is behind a NAT, an adversary behind the same NAT can DoS the relay's available sockets until the end of the measurement. The adversary could also pretend to be the measurer. Such an adversary could induce measurement failures and inaccuracies. (Note: the whitelist is cleared after the measurement is over.) 4. FlashFlow measurement system: Medium term The medium term deployment stage begins after FlashFlow has been implemented and relays are starting to update to a version of Tor that supports it. We plan to host a FlashFlow deployment consisting of a FF coordinator and a single FF measurer on a single 1 Gbit/s machine. Data produced by this deployment will be made available (semi?) publicly, including both v3bw files and intermediate results. Any development changes needed during this time would go through separate proposals. 5. FlashFlow measurement system: Long term In the long term, finishing-touch development work will be done, including adding better authentication and measurement scheduling, and experiments will be run to determine the best way to integrate FlashFlow into the Tor ecosystem. Any development changes needed during this time would go through separate proposals. 5.1 Authentication to Target Relay Short term deployment already had FlashFlow coordinators using TLS certificates when connecting to relays, but in the long term, directory authorities will vote on the consensus parameter for which coordinators should be allowed to perform measurements. The voting is done in the same way they currently vote on recommended tor versions. FlashFlow measurers will be updated to use TLS certificates when connecting to relays too. FlashFlow coordinators will update the contents of MSM_PARAMS cells to contain measurer TLS certificates instead of IP addresses, and relays will update to expect this change. 5.2 Measurement Scheduling Short term deployment only has one FF deployment running. Long term this may no longer be the case because, for example, more than one directory authority decides to adopt it and they each want to run their own deployment. FF deployments will need to coordinate between themselves to not measure the same relay at the same time, and to handle new relays as they join during the middle of a measurement period (during the day). The following is quoted from Section 4.3 of the FlashFlow paper. To measure all relays in the network, the BWAuths periodically determine the measurement schedule. The schedule determines when and by whom a relay should be measured. We assume that the BWAuths have sufficiently synchronized clocks to facilitate coordinating their schedules. A measurement schedule is created for each measurement period, the length p of which determines how often a relay is measured. We use a measurement period of p = 24 hours. To help avoid active denial-of-service attacks on targeted relays, the measurement schedule is randomized and known only to the BWAuths. Before the next measurement period starts, the BWAuths collectively generate a random seed (e.g. using Tor’s secure-randomness protocol). Each BWAuth can then locally determine the shared schedule using pseudorandom bits extracted from that seed. The algorithm to create the schedule considers each measurement period to be divided into a sequence of t-second measurement slots. For each old relay, slots for each BWAuth to measure it are selected uniformly at random without replacement from all slots in the period that have sufficient unallocated measurement capacity to accommodate the measurement. When a new relay appears, it is measured separately by each BWAuth in the first slots with sufficient unallocated capacity. Note that this design ensures that old relays will continue to be measured, with new relays given secondary priority in the order they arrive. 5.3 Experiments [XXX todo] 5.4 Other Changes/Investigations/Ideas - How can FlashFlow data be used in a way that doesn't lead to poor load balancing given the following items that lead to non-uniform client behavior: - Guards that high-traffic HSs choose (for 3 months at a time) - Guard vs middle flag allocation issues - New Guard nodes (Guardfraction) - Exit policies other than default/all - Directory activity - Total onion service activity - Super long-lived circuits - Add a cell that the target relay sends to the coordinator indicating its CPU and memory usage, whether it has a shortage of sockets, how much bandwidth load it has been experiencing lately, etc. Use this information to lower a relays weight, never increase. - If FlashFlow and sbws work together (as opposed to FlashFlow replacing sbws), consider logic for how much sbws can increase/decrease FF results - Coordination of multiple FlashFlow deployments: scheduling of measurements, seeding schedule with shared random value. - Other background/measurement traffic ratios. Dynamic? (known slow relay => more allowed bg traffic?) - Catching relays inflating their measured capacity by dropping background traffic. - What to do about co-located relays. Can they be detected reliably? Should we just add a torrc option a la MyFamily for co-located relays? - What is the explanation for dennis.jackson's scary graphs in this [2] ticket? Was it because of the speed test? Why? Will FlashFlow produce the same behavior? 6. Citations [0] F. Thill. Hidden Service Tracking Detection and Bandwidth Cheating in Tor Anonymity Network. Master’s thesis, Univ. Luxembourg, 2014. [1] A. Johnson, R. Jansen, N. Hopper, A. Segal, and P. Syverson. PeerFlow: Secure Load Balancing in Tor. Proceedings on Privacy Enhancing Technologies (PoPETs), 2017(2), April 2017. [2] Mike Perry: Graph onionperf and consensus information from Rob's experiments https://trac.torproject.org/projects/tor/ticket/33076

5 8

Proposal 320: Removing TAP usage from v2 onion services
by Nick Mathewson 29 May '20

29 May '20

``` Filename: 320-tap-out-again.md Title: Removing TAP usage from v2 onion services Author: Nick Mathewson Created: 11 May 2020 Status: Open ``` (This proposal is part of the Walking Onions spec project. It updates proposal 245.) # Removing TAP from v2 onion services As we implement walking onions, we're faced with a problem: what to do with TAP keys? They are bulky and insecure, and not used for anything besides v2 onion services. Keeping them in SNIPs would consume bandwidth, and keeping them indirectly would consume complexity. It would be nicer to remove TAP keys entirely. But although v2 onion services are obsolescent and their cryptographic parameters are disturbing, we do not want to drop support for them as part of the Walking Onions migration. If we did so, then we would force some users to choose between Walking Onions and v2 onion services, which we do not want to do. Instead, we describe here a phased plan to replace TAP in v2 onion services with ntor. This change improves the forward secrecy of _some_ of the session keys used with v2 onion services, but does not improve their authentication, which is strongly tied to truncated SHA1 hashes of RSA1024 keys. Implementing this change is more complex than similar changes elsewhere in the Tor protocol, since we do not want clients or services to leak whether they have support for this proposal, until support is widespread enough that revealing it is no longer a privacy risk. ## Ntor keys, link specifiers, and SNIPs in v2 descriptors. We define these entries that may appear in v2 onion service descriptors, once per introduction point. "identity-ed25519" "ntor-onion-key" [at most once each per intro point.] These values are in the same format as and follow the same rules as their equivalents in router descriptors. "link-specifiers" [at most once per introduction point] This value is the same as the link specifiers in a v3 onion service descriptor, and follows the same rules. Services should not include any of these fields unless a new network parameter, "hsv2-intro-updated" is set to 1. Clients should not parse these fields or use them unless "hsv2-use-intro-updated" is set to 1. We define a new field that can be used for hsv2 descriptors with walking onions: "snip" [at most once] This value is the same as the snip field introduced to a v3 onion service descriptor by proposal (XXX) and follows the same rules. Services should not include this field unless a new network parameter, "hsv2-intro-snip" is set to 1. Clients should not parse this field or use it unless the parameter "hsv2-use-intro-snip" is set to 1. Additionally, relays SHOULD omit the following legacy intro point parameters when a new network parameter, "hsv2-intro-legacy" is set to 0: "ip-address", "onion-port", and "onion-key". Clients should treat them as optional when "hsv2-tolerate-no-legacy" is set to 1. ## INTRODUCE cells, RENDEZVOUS cells, and ntor. We allow clients to specify the rendezvous point's ntor key in the INTRODUCE2 cell instead of the TAP key. To do this, the client simply sets KLEN to 32, and includes the ntor key for the relay. Clients should only use ntor keys in this way if the network parameter "hsv2-client-rend-ntor" is set to 1, and if the entry "allow-rend-ntor" is present in the onion service descriptor. Services should only advertise "allow-rend-ntor" in this way if the network parameter "hsv2-service-rend-ntor" is set to 1. ## Migration steps First, we implement all of the above, but set it to be disabled by default. We use torrc fields to selectively enable them for testing purposes, to make sure they work. Once all non-LTS versions of Tor without support for this proposal are obsolete, we can safely enable "hsv2-client-rend-ntor", "hsv2-service-rend-ntor", "hsv2-intro-updated", and "hsv2-use-intro-updated". Once all non-LTS versions of Tor without support for walking onions are obsolete, we can safely enable "hsv2-intro-snip", "hsv2-use-intro-snip", and "hsv2-tolerate-no-legacy". Once all non-LTS versions of Tor without support for _both_ of the above implementations are finally obsolete, we can finally set "hsv2-intro-legacy" to 0. ## Future work There is a final TAP-like protocol used for v2 hidden services: the client uses RSA1024 and DH1024 to send information about the rendezvous point and to start negotiating the session key to be used for end-to-end encryption. In theory we could get a benefit to forward secrecy by using ntor instead of TAP here, but we would get not corresponding benefit for authentication, since authentication is still ultimately tied to HSv2's scary RSA1024-plus-truncated-SHA1 combination. Given that, it might be just as good to allow the client to insert a curve25519 key in place of their DH1024 key, and use that for the DH handshake instead. That would be a separate proposal, though: this proposal is enough to allow all relays to drop TAP support.

8 11

Proposal 322: Extending link specifiers to include the directory port
by Nick Mathewson 27 May '20

27 May '20

``` Filename: 322-dirport-linkspec.md Title: Extending link specifiers to include the directory port Author: Nick Mathewson Created: 27 May 2020 Status: Open ``` ## Motivation Directory ports remain the only way to contact a (non-bridge) Tor relay that isn't expressible as a Link Specifier. We haven't specified a link specifier of this kind so far, since it isn't a way to contact a relay to create a channel. But authorities still expose directory ports, and encourage relays to use them preferentially for uploading and downloading. And with Walking Onions, it would be convenient to try to make every kind of "address" a link specifier -- we'd like want authorities to be able to specify a list of link specifiers that can be used to contact them for uploads and downloads. > It is possible that after revision, Walking Onions won't need a way > to specify this information. If so, this proposal should be moved > to "Reserve" status as generally unuseful. ## Proposal We reserve a new link specifier type "dir-url", for use with the directory system. This is a variable-length link specifier, containing a URL prefix. The only currently supported URL schema is "http://". Implementations SHOULD ignore unrecognized schemas. IPv4 and IPv6 addresses MAY be used directory; hostnames are also allowed. Implementations MAY ignore hostnames and only use raw addresses. The URL prefix includes everything through the string "tor" in the directory hierarchy. A dir-url link specifier SHOULD NOT appear in an EXTEND cell; implementations SHOULD reject them if they do appear.

1 0

Proposal 321: Better performance and usability for the MyFamily option (v2)
by Nick Mathewson 27 May '20

27 May '20

``` Filename: 321-happy-families.md Title: Better performance and usability for the MyFamily option (v2) Author: Nick Mathewson Created: 27 May 2020 Status: Open ``` ## Problem statement. The current family mechanism allows well-behaved relays to identify that they all belong to the same 'family', and should not be used in the same circuits. Right now, families work by having every family member list every other family member in its server descriptor. This winds up using O(n^2) space in microdescriptors and server descriptors. (For RAM, we can de-duplicate families which sometimes helps.) Adding or removing a server from the family requires all the other servers to change their torrc settings. The is growth in size is not just a theoretical problem. Family declarations currently make up a little over 55% of the microdescriptors in the directory--around 24% after compression. The largest family has around 270 members. With Walking Onions, 270 members times a 160-bit hashed identifier leads to over 5 kilobytes per SNIP, which is much greater than we'd want to use. This is an updated version of proposal 242. It differs by clarifying requirements and providing a more detailed migration plan. ## Design overview. In this design, every family has a master ed25519 "family key". A node is in the family iff its server descriptor includes a certificate of its ed25519 identity key with the family key. The certificate format the one in the tor-certs.txt spec; we would allocate a new certificate type for this usage. These certificates would need to include the signing key in the appropriate extension. Note that because server descriptors are signed with the node's ed25519 signing key, this creates a bidirectional relationship between the two keys, so that nodes can't be put in families without their consent. ## Changes to router descriptors We add a new entry to server descriptors: "family-cert" NL "-----BEGIN FAMILY CERT-----" NL cert "-----END FAMILY CERT-----". This entry contains a base64-encoded certificate as described above. It may appear any number of times; authorities MAY reject descriptors that include it more than three times. ## Changes to microdescriptors We add a new entry to microdescriptors: `family-keys`. This line contains one or more space-separated strings describing families to which the node belongs. These strings MUST be sorted in lexicographic order. Clients MUST NOT depend on any particular property of these strings. ## Changes to voting algorithm We allocate a new consensus method number for voting on these keys. When generating microdescriptors using a suitable consensus method, the authorities include a "family-keys" line if the underlying server descriptor contains any valid family-cert lines. For each valid family-cert in the server descriptor, they add a base-64-encoded string of that family-cert's signing key. > See also "deriving family lines from family-keys?" below for an > interesting but more difficult extension mechanism that I would > not recommend. ## Relay configuration There are several ways that we could configure relays to let them include family certificates in their descriptors. The easiest would be putting the private family key on each relay, so that the relays could generate their own certificates. This is easy to configure, but slightly risky: if the private key is compromised on any relay, anybody can claim membership in the family. That isn't so very bad, however -- all the relays would need to do in this event would be to move to a new private family key. A more orthodox method would be to keep the private key somewhere offline, and using it to generate a certificate for each relay in the family as needed. These certificates should be made with long-enough lifetimes, and relays should warn when they are going to expire soon. ## Changes to relay behavior Each relay should track which other relays they have seen using the same family-key as itself. When generating a router descriptor, each relay should list all of these relays on the legacy 'family' line. This keeps the "family" lines up-to-date with "family-keys" lines for compliant relays. Relays should continue listing relays in their family lines if they have seen a relay with that identity using the same family-key at any time in the last 7 days. The presence of this line should be configured by a network parameter, `derive-family-line`. Relays whose family lines do not stay at least mostly in sync with their family keys should be marked invalid by the authorities. ## Client behavior Clients should treat node A and node B as belonging to the same family if ANY of these is true: * The client has descriptors for A and B, and A's descriptor lists B in its family line, and B's descriptor lists A in its family line. * Client A has descriptors for A and B, and they both contain the same entry in their family-keys or family-cert. ## Migration For some time, existing relays and clients will not support family certificates. Because of this, we try to make sure above the well-behaved relays will list the same entries in both places. Once enough clients have migrated to using family certificates, authorities SHOULD disable `derive-family-line`. ## Security Listing families remains as voluntary in this design as in today's Tor, though bad-relay hunters can continue to look for families that have not adopted a family key. A hostile relay family could list a "family" line that did not match its "family-certs" values. However, the only reason to do so would be in order to launch a client partitioning attack, which is probably less valuable than the kinds of attacks that they could run by simply not listing families at all. ## Appendix: deriving family lines from family-keys? As an alternative, we might declare that _authorities_ should keep family lines in sync with family-certs. Here is a design sketch of how we might do that, but I don't think it's actually a good idea, since it would require major changes to the data flow of the voting system. In this design, authorties would include a "family-keys" line in each router section in their votes corresponding to a relay with any family-cert. When generating final microdescriptors using this method, the authorities would use these lines to add entries to the microdescriptors' family lines: 1. For every relay appearing in a routerstatus's family-keys, the relays calculate a consensus family-keys value by listing including all those keys that are listed by a majority of those voters listing the same router with the same descriptor. (This is the algorithm we use for voting on other values derived from the descriptor.) 2. The authorities then compute a set of "expanded families": one for each family key. Each "expanded family" is a set containing every router in the consensus associated with that key in its consensus family-keys value. 3. The authorities discard all "expanded families" of size 1 or smaller. 4. Every router listed for the "expanded family" has every other router added to the "family" line in its microdescriptor. (The "family" line is then re-canonicalized according to the rules of proposal 298 to remove its ) 5. Note that the final microdescriptor consensus will include the digest of the derived microdescriptor in step 4, rather than the digest of the microdescriptor listed in the original votes. (This calculation is deterministic.) The problem with this approach is that authorities would have to s to fetch microdescriptors they do not have in order to replace their family lines. Currently, voting never requires an authority to fetch a microdescriptor from another authority. If we implement vote compression and diffs as in the Walking Onions proposal, however, we might suppose that votes could include microdescriptors directly. Still, this is likely more complexity than we want for a transition mechanism. ## Appendix: Deriving family-keys from families?? We might also imagine that authorities could infer which families exist from the graph of family relationships, and then include synthetic "family-keys" entries for routers that belong to the same family. This has two challenges: first, to compute these synthetic family keys, the authorities would need to have the same graph of family relationships to begin with, which once again would require them to include the complete list of families in their votes. Secondly, finding all the families is equivalent to finding all maximal cliques in a graph. This problem is NP-hard in its general case. Although polynomial solutions exist for nice well-behaved graphs, we'd still need to worry about hostile relays including strange family relationships in order to drive the algorithm into its exponential cases. ## Appendix: New assigned values We need a new assigned value for the certificate type used for family signing keys. We need a new consensus method for placing family-keys lines in microdescriptors. ## Appendix: New network parameters * `derive-family-line`: If 1, relays should derive family lines from observed family-keys. If 0, they do not. Min: 0, Max: 1. Default: 1.

1 0

So Windows is Adding a Package Manager.....
by Keifer Bly 24 May '20

24 May '20

2 1

tor-dev Digest, Vol 112, Issue 21
by Support Team 20 May '20

20 May '20

Thank you for the insights. On Wed, May 20 2020, at 12:00 PM, <tor-dev(a)lists.torproject.org> Send tor-dev mailing list submissions to tor-dev(a)lists.torproject.org To subscribe or unsubscribe via the World Wide Web, visit https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-dev or, via email, send a message with subject or body 'help' to tor-dev-request(a)lists.torproject.org You can reach the person managing the list at tor-dev-owner(a)lists.torproject.org When replying, please edit your Subject line so it is more specific than "Re: Contents of tor-dev digest..." Today's Topics: 1. Re: Proposal 319: RELAY_FRAGMENT cells (Nick Mathewson) 2. Re: Proposal 320: Removing TAP usage from v2 onion services (Nick Mathewson) 3. Re: Proposal 320: Removing TAP usage from v2 onion services (Nick Mathewson) 4. Re: Proposal 320: Removing TAP usage from v2 onion services (Sebastian Hahn) ---------------------------------------------------------------------- Message: 1 Date: Tue, 19 May 2020 13:29:54 -0400 From: Nick Mathewson To: tor-dev(a)lists.torproject.org Subject: Re: [tor-dev] Proposal 319: RELAY_FRAGMENT cells Message-ID: Content-Type: text/plain; charset="UTF-8" On Thu, May 14, 2020 at 3:15 PM David Goulet wrote: > > On 11 May (16:47:24), Nick Mathewson wrote: [...] > > # Onion service concerns. > > > > We allocate a new extension for use in the ESTABLISH_INTRO by onion services, > > to indicate that they can receive a wide INTRODUCE2 cell. This extension > > contains: > > > > struct wide_intro2_ok { > > u16 max_len; > > } > > > > We allocate a new extension for use in the 'ESTABLISH_RENDEZVOUS' > > cell, to indicate acceptance of wide 'RENDEZVOUS2' cells. This > > extension contains: > > > > struct wide_rend2_ok { > > u16 max_len; > > } > > > > (Note that 'ESTABLISH_RENDEZVOUS' cells do not currently have a an > > extension mechanism. They should be extended to use the same > > extension format as 'ESTABLISH_INTRO' cells, with extensions placed > > after the rendezvous cookie.) > > Why would a client need to announce wide cells in the ESTABLISH phase as > opposed to using protover "Relay=N" ? This is not for announcing support of wide cells -- this is for reporting a setting for how wide fragmented cells should be. > The maximum length of a fragmented cell is capped to 2^16 (u16) so we don't > really need the establish process to inform us of the maximum expected length > but rather use the max_len in the first fragment? This all comes back to an earlier part of the proposal: Not all lengths up to 65535 are valid lengths for a fragmented cell. Any length under 499 bytes SHOULD cause the circuit to close, since that could fit into a non-fragmented RELAY cell. Parties SHOULD enforce maximum lengths for cell types that they understand. In other words, I'm imagining that there is a maximum length for each cell type that is much shorter than 65535, even though we're using two bytes for the length field. The extension in the establish_intro cell is to tell the intro point the longest introduce1 cell that it should accept; this extension in the establish_rend cell is to tell the rendezvous point the longest rendezvous1 cell that it should accept. Another way we could do this would be with a set of network parameters to describe the maximum length of each fragmented cell. Do you think that would be simpler? (I can't quite remember why I specified it this way in the first place.) > Furthermore, ESTABLISH_INTRO has extensions (only 1 as of today) so they could > also be fragments themselves and thus I'm not sure I see the point of having > two different ways of "expecting" fragments for the ESTABLISH_* cells and the > INTRO/RENDEZVOUS cells? The difference thing here is that everybody can tell which protocols that a relay supports, but there is no automatic way to tell which protocols an onion service or client supports. Since INTRODUCE2/RENDEZVOUS2 cells are handled by these clients, they need to get opted into by the relays. (I'm not sure I understood the question completely.) > > # Compatibility > > > > This proposal will require the allocation of a new 'Relay' protocol version, > > to indicate understanding of the RELAY_FRAGMENTED command. > > Here is a thought about a DoS vector. Here goes: > > As an upper limit of 65KB total fragment size, it represents ~126 cells in > total so I could basically send *125* cells and then stop which will put in > memory a bit more than 64KB and it will stay there until the last fragment is > received. > > And then I do that on 1000 different circuits bringing the total count in > memory to 64GB. All stuck there, all "waiting" for the last fragment. > > Our OOM would kick in killing circuits but it just seems to me a very easy way > to continously kick the OOM of a _service_ which is pretty bad side channel. A few responses here: First, we shouldn't allow 65535-byte fragmented cells. The actual maximum length should be something more like 1024 or 4096 bytes. Second, we should make sure that when we are reassembling cells, we use the same buf_t buffers that we use for other stuff. Our buffers are timestamped, so we can tell which buffer has had data stalling for the longest, and we should use that to make sure we're killing off the right circuits preferentially. Third, fragments should only be allowed at an onion service for INTRODUCE2, and those should only come one at a time from each introduction point, so the number that it's reassembling at the time will be limited by the number of intro circuits it has open. It'll be the the intro points that have to be keeping a bunch of cells in assembly at once, and be ready to kill off circuits that dawdle too long. Does this make more sense? If so I'll try to clarify it in the proposal.

1 0

Proposal 319: RELAY_FRAGMENT cells
by Nick Mathewson 19 May '20

19 May '20

``` Filename: 319-wide-everything.md Title: RELAY_FRAGMENT cells Author: Nick Mathewson Created: 11 May 2020 Status: Open ``` (This proposal is part of the Walking Onions spec project.) # Introduction Proposal 249 described a system for `CREATE` cells to become wider, in order to accommodate hybrid crypto. And in order to send those cell bodies across circuits, it described a way to split `CREATE` cells into multiple `EXTEND` cells. But there are other cell types that can need to be wider too. For example, `INTRODUCE` and `RENDEZVOUS` cells also contain key material used for a handshake: if handshakes need to grow larger, then so do these cells. This proposal describes an encoding for arbitrary "wide" relay cells, that can be used to send a wide variant of anything. To be clear, although this proposal describes a way that all relay cells can become "wide", I do not propose that wide cells should actually be _allowed_ for all relay cell types. # Proposal We add a new relay cell type: `RELAY_FRAGMENT`. This cell type contains part of another relay cell. A `RELAY_FRAGEMENT` cell can either introduce a new fragmented cell, or can continue one that is already in progress. The format of a RELAY_FRAGMENT body is one of the following: // First body in a series struct fragment_begin { // What relay_command is in use for the underlying cell? u8 relay_command; // What will the total length of the cell be once it is reassembled? u16 total_len; // Bytes for the cell body u8 body[]; } // all other cells. struct fragment_continued { // More bytes for the cell body. u8 body[]; } To send a fragmented cell, first a party sends a RELAY_FRAGMENT cell containing a "fragment_begin" payload. This payload describes the total length of the cell, the relay command Fragmented cells other than the last one in sequence MUST be sent full of as much data as possible. Parties SHOULD close a circuit if they receive a non-full fragmented cell that is not the last fragment in a sequence. Fragmented cells MUST NOT be interleaved with other relay cells on a circuit, other than cells used for flow control. (Currently, this is only SENDME cells.) If any party receives any cell on a circuit, other than a flow control cell or a RELAY_FRAGEMENT cell, before the fragmented cell is complete, than it SHOULD close the circuit. Parties MUST NOT send extra data in fragmented cells beyond the amount given in the first 'total_len' field. Not every relay command may be sent in a fragmented cell. In this proposal, we allow the following cell types to be fragmented: EXTEND2, EXTENDED2, INTRODUCE1, INTRODUCE2, RENDEZVOUS. Any party receiving a command that they believe should not be fragmented should close the circuit. Not all lengths up to 65535 are valid lengths for a fragmented cell. Any length under 499 bytes SHOULD cause the circuit to close, since that could fit into a non-fragmented RELAY cell. Parties SHOULD enforce maximum lengths for cell types that they understand. All `RELAY_FRAGMENT` cells for the fragmented cell must have the same Stream ID. (For those cells allowed above, the Stream ID is always zero.) Implementations SHOULD close a circuit if they receive fragments with mismatched Stream ID. # Onion service concerns. We allocate a new extension for use in the ESTABLISH_INTRO by onion services, to indicate that they can receive a wide INTRODUCE2 cell. This extension contains: struct wide_intro2_ok { u16 max_len; } We allocate a new extension for use in the `ESTABLISH_RENDEZVOUS` cell, to indicate acceptance of wide `RENDEZVOUS2` cells. This extension contains: struct wide_rend2_ok { u16 max_len; } (Note that `ESTABLISH_RENDEZVOUS` cells do not currently have a an extension mechanism. They should be extended to use the same extension format as `ESTABLISH_INTRO` cells, with extensions placed after the rendezvous cookie.) # Handling RELAY_EARLY The first fragment of each EXTEND cell should be tagged with `RELAY_EARLY`. The remaining fragments should not. Relays should accept `EXTEND` cells if and only if their _first_ fragment is tagged with `RELAY_EARLY`. > Rationale: We could allow any fragment to be tagged, but that would give > hostile guards an opportunity to move RELAY_EARLY tags around and build a > covert channel. But if we later move to a relay encryption method that > lets us authenticate RELAY_EARLY, we could then require only that _any_ > fragment has RELAY_EARLY set. # Compatibility This proposal will require the allocation of a new 'Relay' protocol version, to indicate understanding of the RELAY_FRAGMENTED command.

4 3

Re: [tor-dev] Chutney code refactoring
by c 17 May '20

17 May '20

On Fri, 15 May 2020 11:20:54 +1000 teor <teor(a)riseup.net> wrote: > You could try deleting functions, and then running tor's > "make test-network-all". If that test still passes, then the code is probably > unused (in practice). > > Chutney's CI does a few more tests for different tor versions, but I'm pretty > sure they all use the same code. This is what I will end up doing. I planned on it, but humans tend to know the purpose of the code better, and computers just run what is given to them :) so I figured it was best to check here first, with people more experienced with the codebase. Thanks, Caitlin

1 0

Re: [tor-dev] Chutney code refactoring
by c 17 May '20

17 May '20

On Thu, 14 May 2020 19:58:45 -0400 Nick Mathewson <nickm(a)freehaven.net> wrote: > isInExpectedDirInfoDocs looks like it might once have done something useful. My impression is to remove it unless we anticipate needing it in the near future. > configure and restart and wait_for_bootstrap are all in use; they are > invoked by name, from the command line, at the end of runConfigFile, > where it says `return getattr(network, verb)()`. Ah, I see now, thank you. I think it would be wise to comment the functions as such to indicate that they are used on the command line. > I think the rest are likely to be unused. I did grep most of these and found no signs of use, so this is my impression as well. For now I will make a mental note of all this while I work on more pertinent improvements to the code, but I do believe it would be nice to deal with truly unused code sooner rather than later. Caitlin

1 0