[tor-commits] [torspec/main] Update Prop#324 for Flow Control improvements

dgoulet at torproject.org dgoulet at torproject.org
Mon Oct 4 14:52:17 UTC 2021


commit 311c90ab364ae8a10f7388ea6791e88701a40b7f
Author: Mike Perry <mikeperry-git at torproject.org>
Date:   Sun Aug 22 04:05:56 2021 +0000

    Update Prop#324 for Flow Control improvements
    
     - Specify rate advertisement of edge drain rate in XON, to minimize chatter
     - Limit the frequency of XON/XOFF with consensus parsams
     - Describe dropmark defenses using XON/XOFF limits
     - Describe how half-closed edge connections are handled with flow control
     - Describe flow control consensus parameters
     - Describe flow control shadow experiments and live comparison
     - Create and describe additional consensus parameters that will influence
       congestion control performance and memory usage
     - Clarify performance metrics involved in experiments
     - Remove some stale XXXs and TODOs
---
 proposals/324-rtt-congestion-control.txt | 579 +++++++++++++++++++++++++------
 1 file changed, 466 insertions(+), 113 deletions(-)

diff --git a/proposals/324-rtt-congestion-control.txt b/proposals/324-rtt-congestion-control.txt
index f4397b6..36e10e7 100644
--- a/proposals/324-rtt-congestion-control.txt
+++ b/proposals/324-rtt-congestion-control.txt
@@ -151,6 +151,12 @@ RTT, or 100 times smaller than the stored EWMA RTT, then we do not record that
 estimate, and do not update BDP or the congestion control algorithms for that
 SENDME ack.
 
+Moreover, if a clock stall is detected by *any* circuit, this fact is
+cached, and this cached value is used on circuits for which we do not
+have enough data to compute the above heueristics. This cached value is
+also exported for use by the edge connection rate calculations done by
+[XON_ADVISORY].
+
 2.2. SENDME behavior changes
 
 We will make four major changes to SENDME behavior to aid in computing
@@ -173,15 +179,6 @@ well, using Proposal 325.
   TODO: Refer to or include final version of the negotiation proposal for
   this: https://gitlab.torproject.org/tpo/core/tor/-/issues/40377
 
-Second, all end-to-end relay cells except RELAY_COMMAND_DROP and SENDME
-itself will count towards SENDME cell counts. The details behind how these
-cells are handled is addressed in section [SYSTEM_INTERACTIONS].
-
-   XXX: Hrmm, this is not strictly necessary for this proposal to function.
-   It may make conflux easier though, but we can defer it to then. The
-   current implementation only counts RELAY_COMMAND_DATA towards sendme
-   acks, which is the same behavior as the old fixed sendme implementation.
-
 Third, authenticated SENDMEs can remain as-is in terms of protocol
 behavior, but will require some implementation updates to account for
 variable window sizes and variable SENDME pacing. In particular, the
@@ -337,7 +334,7 @@ With this, the calculation becomes:
 
 Note that because we are counting the number of cells *between* the first
 and last sendme of the congestion window, we must subtract 1 from the number
-of sendmes actually recieved. Over the time period between the first and last
+of sendmes actually received. Over the time period between the first and last
 sendme of the congestion window, the other endpoint successfully read
 (num_sendmes-1) * cc_sendme_inc cells.
 
@@ -447,9 +444,10 @@ each ack. Congestion signals are only evaluated when it reaches 0.
 
 Note that because the congestion signal threshold of TOR_WESTWOOD is a
 function of RTT_max, and excessive queuing can cause an increase in RTT_max,
-TOR_WESTWOOD may have a runaway condition. For this reason, we specify a
-cc_westwodd_rtt_m backoff parameter, which can reduce RTT_max by a percentage
-of the difference between RTT_min and RTT_max, in the event of congestion.
+TOR_WESTWOOD may have runaway conditions. Additionally, if stream activity is
+constant, but of a lower bandwidth than the circuit, this will not drive the
+RTT upwards, and this can result in a congestion window that continues to
+increase in the absence of any other concurrent activity.
 
 Here is the complete congestion window algorithm for Tor Westwood. This will run
 each time we get a SENDME (aka sendme_process_circuit_level()):
@@ -604,6 +602,12 @@ following functions:
    - connection_edge_package_raw_inbuf()
    - circuit_resume_edge_reading()
 
+The decision on when a stream is blocked is performed in:
+  - sendme_note_stream_data_packaged()
+  - sendme_stream_data_received()
+  - sendme_connection_edge_consider_sending()
+  - sendme_process_stream_level()
+
 Tor currently maintains separate windows for each stream on a circuit,
 to provide individual stream flow control. Circuit windows are SENDME
 acked as soon as a relay data cell is decrypted and recognized. Stream
@@ -637,50 +641,161 @@ either have to commit to allocating a full congestion window worth
 memory for each stream, or impose a speed limit on our streams.
 
 Hence, we will discard stream windows entirely, and instead use a
-simpler buffer-based design that uses XON/XOFF as a backstop. This will
-allow us to make full use of the circuit congestion window for every
-stream in combination, while still avoiding buffer buildup inside the
-network.
+simpler buffer-based design that uses XON/XOFF to signal when this
+buffer is too large. Additionally, the XON cell will contain advisory
+rate information based on the rate at which that edge connection can
+write data while it has data to write. The other endpoint can rate limit
+sending data for that stream to the rate advertised in the XON, to avoid
+excessive XON/XOFF chatter and sub-optimal behavior.
+
+This will allow us to make full use of the circuit congestion window for
+every stream in combination, while still avoiding buffer buildup inside
+the network.
 
 4.1. Stream Flow Control Without Windows [WINDOWLESS_FLOW]
 
-Each endpoint (client, Exit, or onion service) should send circuit-level
+Each endpoint (client, Exit, or onion service) sends circuit-level
 SENDME acks for all circuit cells as soon as they are decrypted and
-recognized, but *before* delivery to their edge connections. If the edge
-connection is blocked because an application is not reading, data will
-build up in a queue at that endpoint.
-
-Three consensus parameters will govern the max length of this queue:
-xoff_client, xoff_exit, and xoff_mobile. These will be used for Tor
-clients, exits, and mobile devices, respectively. These cutoffs will be
-a percentage of current 'cwnd' rather than number of cells. Something
-like 5% of 'cwnd' should be plenty, since these edge connections should
-normally drain *much* faster than Tor itself.
-
-If the length of an application stream queue exceeds the size provided
-in the appropriate consensus parameter, a RELAY_COMMAND_STREAM_XOFF will
-be sent, which instructs the other endpoint to stop sending from that
-edge connection. This XOFF cell can optionally contain any available
-stream data, as well.
-
-As soon as the queue begins to drain, a RELAY_COMMAND_STREAM_XON will
-sent, which allows the other end to resume reading on that edge
-connection. Because application streams should drain quickly once they
-are active, we will send the XON command as soon as they start draining.
-If the queues fill again, another XOFF will be sent. If this results in
-excessive XOFF/XON flapping and chatter, we will also use consensus
-parameters xon_client, xon_exit, and xon_mobile to optionally specify
-when to send an XON. These parameters will be defined in terms of cells
-below the xoff_* levels, rather than percentage. The XON cells can also
-contain stream data, if any is available.
-
-Tor's OOM killer will be invoked to close any streams whose application
-buffer grows too large, due to memory shortage, or malicious endpoints.
-
-Note that no stream buffer should ever grow larger than the xoff level
-plus 'cwnd', unless an endpoint is ignoring XOFF. So,
-'xoff_{client,exit,mobile} + cwnd' should be the hard-close stream
-cutoff, regardless of OOM killer status.
+recognized, but *before* delivery to their edge connections.
+
+This means that if the edge connection is blocked because an
+application's SOCKS connection or a destination site's TCP connection is
+not reading, data will build up in a queue at that endpoint,
+specifically in the edge connection's outbuf.
+
+Consensus parameters will govern the length of this queue that
+determines when XON and XOFF cells are sent, as well as when advisory
+XON cells that contain rate information can be sent. These parameters
+are separate for the queue lengths of exits, and of clients/services.
+
+(Because clients and services will typically have localhost connections
+for their edges, they will need similar buffering limits. Exits may have
+different properties, since their edges will be remote.)
+
+The trunnel relay cell payload definitions for XON and XOFF are:
+
+struct xoff_cell {
+  u8 version IN [0x00];
+}
+
+struct xon_cell {
+  u8 version IN [0x00];
+
+  u32 kbps_ewma;
+}
+
+4.1.1. XON/XOFF behavior
+
+If the length of an edge outbuf queue exceeds the size provided in the
+appropriate client or exit XOFF consensus parameter, a
+RELAY_COMMAND_STREAM_XOFF will be sent, which instructs the other endpoint to
+stop sending from that edge connection.
+
+Once the queue is expected to empty, a RELAY_COMMAND_STREAM_XON will be sent,
+which allows the other end to resume reading on that edge connection. This XON
+also indicates the average rate of queue drain since the XOFF.
+
+Advisory XON cells are also sent whenever the edge connection's drain
+rate changes by more than 'cc_xon_change_pct' percent compared to
+the previously sent XON cell's value.
+
+4.1.2. Edge bandwidth rate advertisement [XON_ADVISORY]
+
+As noted above, the XON cell provides a field to indicate the N_EWMA rate which
+edge connections drain their outgoing buffers.
+
+To compute the drain rate, we maintain a timestamp and a byte count of how many
+bytes were written onto the socket from the connection outbuf.
+
+In order to measure the drain rate of a connection, we need to measure the time
+it took between flushing N bytes on the socket and when the socket is available
+for writing again. In other words, we are measuring the time it took for the
+kernel to send N bytes between the first flush on the socket and the next
+poll() write event.
+
+For example, lets say we just wrote 100 bytes on the socket at time t = 0sec
+and at time t = 2sec the socket becomes writeable again, we then estimate that
+the rate of the socket is 100 / 2sec thus 50B/sec.
+
+To make such measurement, we start the timer by recording a timestamp as soon
+as data begins to accumulate in an edge connection's outbuf, currently 16KB (32
+cells). We use such value for now because Tor write up to 32 cells at once on a
+connection outbuf and so we use this burst of data as an indicator that bytes
+are starting to accumulate.
+
+After 'cc_xon_rate' cells worth of stream data, we use N_EWMA to average this
+rate into a running EWMA average, with N specified by consensus parameter
+'cc_xon_ewma_cnt'. Every EWMA update, the byte count is set to 0 and a new
+timestamp is recorded. In this way, the EWMA counter is averaging N counts of
+'cc_xon_rate' cells worth of bytes each.
+
+If the buffers are non-zero, and we have sent an XON before, and the N_EWMA
+rate has changed more than 'cc_xon_change_pct' since the last XON, we send an
+updated rate. Because the EWMA rate is only updated every 'cc_xon_rate' cells
+worth of bytes, such advisory XON updates can not be sent more frequent than
+this, and should be sent much less often in practice.
+
+When the outbuf completely drains to 0, and has been 0 for 'cc_xon_rate' cells
+worth of data, we double the EWMA rate. We continue to double it while the
+outbuf is 0, every 'cc_xon_rate' cells. The measurement timestamp is also set
+back to 0.
+
+When an XOFF is sent, the EWMA rate is reset to 0, to allow fresh calculation
+upon drain.
+
+If a clock stall or jump is detected by [CLOCK_HEURISTICS], we also
+clear the fields, but do not record them in ewma.
+
+NOTE: Because our timestamps are microseconds, we chose to compute and
+transmit both of these rates as 1000 byte/sec units, as this reduces the
+number of multiplications and divisions and avoids precision loss.
+
+4.1.3. Oomkiller behavior
+
+A malicious client can attempt to exhaust memory in an Exits outbufs, by
+ignoring XOFF and advisory XONs. Implementations MAY choose to close specific
+streams with outbufs that grow too large, but since the exit does not know
+with certainty the client's congestion window, it is non-trival to determine
+the exact upper limit a well-behaved client might send on a blocked stream.
+
+Implementations MUST close the streams with the oldest chunks present in their
+outbufs, while under global memory pressure, until memory pressure is
+relieved.
+
+4.1.4. Sidechannel mitigation
+
+In order to mitigate DropMark attacks[28], both XOFF and advisory XON
+transmission must be restricted. Because DropMark attacks are most severe
+before data is sent, clients MUST ensure that an XOFF does not arrive before
+it has sent the appropriate XOFF limit of bytes on a stream ('cc_xoff_exit'
+for exits, 'cc_xoff_client' for onions).
+
+Clients also SHOULD ensure that advisory XONs do not arrive before the
+minimum of the XOFF limit and 'cc_xon_rate' full cells worth of bytes have
+been transmitted.
+
+Clients SHOULD ensure that advisory XONs do not arrive more frequently than
+every 'cc_xon_rate' cells worth of sent data. Clients also SHOULD ensure than
+XOFFs do not arrive more frequently than every XOFF limit worth of sent data.
+
+Implementations SHOULD close the circuit if these limits are violated on the
+client-side, to detect and resist dropmark attacks[28].
+
+Additionally, because edges no longer use stream SENDME windows, we alter the
+half-closed connection handling to be time based instead of data quantity
+based. Half-closed connections are allowed to receive data up to the larger
+value of the congestion control max_rtt field or the circuit build timeout
+(for onion service circuits, we use twice the circuit build timeout). Any data
+or relay cells after this point are considered invalid data on the circuit.
+
+Recall that all of the dropped cell enforcement in C-Tor is performed by
+accounting data provided through the control port CIRC_BW fields, currently
+enforced only by using the vanguards addon[29].
+
+The C-Tor implementation exposes all of these properties to CIRC_BW for
+vanguards to enforce, but does not enforce them itself. So violations of any
+of these limits do not cause circuit closure unless that addon is used (as
+with the rest of the dropped cell side channel handling in C-Tor).
 
 
 5. System Interactions [SYSTEM_INTERACTIONS]
@@ -788,15 +903,6 @@ delivery of SENDMEs will still allow a full congestion window full of
 data to arrive. This will also require tuning and experimentation, and
 optimum results will vary between simulator and live network.
 
-  TODO: Can we use explicit SENDME sequence number acking to make a
-       connection-resumption conflux, to recover from circuit collapse
-       or client migration? I am having trouble coming up with a design
-       that does not require Exits to maintain a full congestion
-       window full of data as a retransmit buffer in the event of
-       circuit close. Such reconnect activity might require assistance
-       from Guard relays so that they can help clients discover which
-       cells were sent vs lost.
-
 
 6. Performance Evaluation [EVALUATION]
 
@@ -810,13 +916,23 @@ determine how stable the RTT measurements of circuits are across the
 live Tor network, to determine if we need more frequent SENDMEs, and/or
 need to use any RTT smoothing or averaging.
 
-
 These experiments were performed using onion service clients and services on
 the live Tor network. From these experiments, we tuned the RTT and BDP
 estimators, and arrived at reasonable values for EWMA smoothing and the
 minimum number of SENDME acks required to estimate BDP.
 
-We should check small variations in the EWMA smoothing and minimum BDP ack
+Additionally, we specified that the algorithms maintain previous congestion
+window estimates in the event that a circuit goes idle, rather than revert to
+slow start. We experimented with intermittent idle/active live onion clients
+to make sure that this behavior is acceptable, and it appeared to be.
+
+In Shadow experimentation, the primary thing to test will be if the OR conn on
+Exit relays blocks too frequently when under load, thus causing excessive
+congestion signals, and overuse of the Inflight BDP estimator as opposed
+to SENDME or CWND BDP. It may also be the case that this behavior is optimal,
+even if it does happen.
+
+Finally, we should check small variations in the EWMA smoothing and minimum BDP ack
 counts in Shadow experimentation, to check for high variability in these
 estimates, and other surprises.
 
@@ -847,44 +963,93 @@ parameter for them. This will allow us to set more reasonable parameter
 values, without waiting for all clients to upgrade.
 
 Because custom congestion control can be deployed by any Exit or onion
-service that desires better service, we will need to be particularly
-careful about how congestion control algorithms interact with rogue
-implementations that more aggressively increase their window sizes.
-During these adversarial-style experiments, we must verify that cheaters
-do not get better service, and that Tor's circuit OOM killer properly
-closes circuits that seriously abuse the congestion control algorithm,
-as per [SECURITY_ANALYSIS]. This may requiring tuning CircuitEWMAHalfLife,
-as well as the oomkiller cutoffs.
-
-Additionally, we specified that the algorithms maintain previous congestion
-window estimates in the event that a circuit goes idle, rather than revert
-to slow start. We should experiment with intermittent idle/active clients
-to make sure that this behavior is acceptable.
+service that desires better service, we will need to be particularly careful
+about how congestion control algorithms interact with rogue implementations
+that more aggressively increase their window sizes.  During these
+adversarial-style experiments, we must verify that cheaters do not get
+better service, and that Tor's circuit OOM killer properly closes circuits
+that seriously abuse the congestion control algorithm, as per
+[SECURITY_ANALYSIS]. This may requiring tuning 'circ_max_cell_queue_size',
+and 'CircuitPriorityHalflifeMsec'.
+
+Additionally, we will need to experiment with reducing the cell queue limits
+on OR conns before they are blocked (OR_CONN_HIGHWATER), and study the
+interaction of that with treating the or conn block as a congestion signal.
+
+Finally, we will need to monitor our Shadow experiments for evidence of ack
+compression, which can cause the BDP estimator to over-estimate the congestion
+window. We will instrument our Shadow simulations to alert if they discover
+excessive congestion window values, and tweak 'cc_bwe_min' and
+'cc_sendme_inc' appropriately. We can set the 'cc_cwnd_max' parameter value
+to low values (eg: ~2000 or so) to watch for evidence of this in Shadow, and
+log. Similarly, we should watch for evidence that the 'cc_cwnd_min' parameter
+value is rarely hit in Shadow, as this indicates that the cwnd may be too
+small to measure BDP (for cwnd less than 'cc_sendme_inc'*'cc_bwe_min').
 
 6.3. Flow Control Algorithm Experiments
 
-We will need to tune the xoff_* consensus parameters to minimize the
-amount of edge connection buffering as well as XON/XOFF chatter for
-Exits. This can be done in simulation, but will require fine-tuning on
-the live network.
+Flow control only applies when the edges outside of Tor (SOCKS application,
+onion service application, or TCP destination site) are *slower* than Tor's
+congestion window. This typically means that the application is either
+suspended or reading too slow off its SOCKS connection, or the TCP destination
+site itself is bandwidth throttled on its downstream.
+
+To examine these properties, we will perform live onion service testing, where
+curl is used to download a large file. We will test no rate limit, and
+verify that XON/XOFF was never sent. We then suspend this download, verify
+that an XOFF is sent, and transmission stops. Upon resuming this download, the
+download rate should return to normal. We will also use curl's --limit-rate
+option, to exercise that the flow control properly measures the drain rate and
+limits the buffering in the outbuf, modulo kernel socket and localhost TCP
+buffering.
+
+However, flow control can also get triggered at Exits in a situation where
+either TCP fairness issues or Tor's mainloop does not properly allocate
+enough capacity to edge uploads, causing them to be rate limited below the
+circuit's congestion window, even though the TCP destination actually has
+sufficient downstream capacity.
+
+Exits are also most vulnerable to the buffer bloat caused by such uploads,
+since there may be many uploads active at once.
+
+To study this, we will run shadow simulations. Because Shadow does *not*
+rate limit its tgen TCP endpoints, and only rate limits the relays
+themselves, if *any* XON/XOFF activity happens in Shadow *at all*, it is
+evidence that such fairness issues can ocurr.
+
+Just in case Shadow does not have sufficient edge activity to trigger such
+emergent behavior, when congestion control is enabled on the live network, we
+will also need to instrument a live exit, to verify that XON/XOFF is not
+happening frequently on it. Relays may also report these statistics in
+extra-info descriptor, to help with monitoring the live network conditions, but
+this might also require aggregation or minimization.
+
+If excessive XOFF/XON activity happens at Exits, we will need to investigate
+tuning the libevent mainloop to prioritize edge writes over orconn writes.
+Additionally, we can lower 'cc_xoff_exit'. Linux Exits can also lower the
+'net.ipv[46].tcp_wmem' sysctl value, to reduce the amount of kernel socket
+buffering they do on such streams, which will improve XON/XOFF responsiveness
+and reduce memory usage.
 
-Additionally, we will need to experiment with reducing the cell queue
-limits on OR conns before they are blocked, and study the interaction
-of that with treating the or conn block as a congestion signal.
+6.4. Performance Metrics [EVALUATION_METRICS]
 
-  TODO: We should make the cell queue highwater value into a consensus
-  parameter in the flow control implementation.
+The primary metrics that we will be using to measure the effectiveness
+of congestion control in simulation are TTFB/RTT, throughput, and utilization.
 
-Relays may report these statistics in extra-info descriptor, to help with
-monitoring the live network conditions, but this might also require
-aggregation or minimization.
+We will calibrate the Shadow simulator so that it has similar CDFs for all of
+these metrics as the live network, without using congestion control.
 
-6.4. Performance Metrics [EVALUATION_METRICS]
+Then, we will want to inspect CDFs of these three metrics for various
+congestion control algorithms and parameters. 
+
+The live network testing will also spot-check performance characteristics of
+a couple algorithm and parameter sets, to ensure we see similar results as
+Shadow.
 
-Because congestion control will affect so many aspects of performance,
-from throughput to RTT, to load balancing, queue length, overload, and
-other failure conditions, the full set of performance metrics will be
-required:
+On the live network, because congestion control will affect so many aspects of
+performance, from throughput to RTT, to load balancing, queue length,
+overload, and other failure conditions, the full set of performance metrics
+will be required, to check for any emergent behaviors:
   https://gitlab.torproject.org/legacy/trac/-/wikis/org/roadmaps/CoreTor/PerformanceMetrics
 
 We will also need to monitor network health for relay queue lengths,
@@ -909,7 +1074,7 @@ These are sorted in order of importance to tune, most important first.
           use, as an integer.
     - Range: [0,3]  (0=fixed, 1=Westwood, 2=Vegas, 3=NOLA)
     - Default: 2
-    - Tuning Values: 0-3
+    - Tuning Values: [2,3]
     - Tuning Notes:
            These algorithms need to be tested against percentages of current
            fixed alg client competition, in Shadow. Their optimal parameter
@@ -923,7 +1088,7 @@ These are sorted in order of importance to tune, most important first.
           estimate bandwidth (and thus BDP).
     - Range: [2, 20]
     - Default: 5
-    - Tuning Values: 3-5
+    - Tuning Values: 4-10
     - Tuning Notes:
            The lower this value is, the sooner we can get an estimate of
            the true BDP of a circuit. Low values may lead to massive
@@ -966,7 +1131,7 @@ These are sorted in order of importance to tune, most important first.
   cc_cwnd_min:
     - Description: The minimum allowed cwnd.
     - Range: [cc_sendme_inc, 1000]
-    - Tuning Values: [50, 100, 150]
+    - Tuning Values: [100, 150, 200]
     - Tuning Notes:
            If the cwnd falls below cc_sendme_inc, connections can't send
            enough data to get any acks, and will stall. If it falls below
@@ -974,6 +1139,19 @@ These are sorted in order of importance to tune, most important first.
            estimates. Likely we want to set this around
            cc_bwe_min*cc_sendme_inc, but no lower than cc_sendme_inc.
 
+  cc_cwnd_max:
+    - Description: The maximum allowed cwnd.
+    - Range: [cc_sendme_inc, INT32_MAX]
+    - Default: INT32_MAX
+    - Tuning Values: [5000, 10000, 20000]
+    - Tuning Notes:
+       If cc_bwe_min is set too low, the BDP estimator may over-estimate the
+       congestion window in the presence of large queues, due to SENDME ack
+       compression. Once all clients have upgraded to congestion control,
+       queues large enough to cause ack compression should become rare. This
+       parameter exists primarily to verify this in Shadow, but we preserve it
+       as a consensus parameter for emergency use in the live network, as well.
+
   circwindow:
     - Description: Initial congestion window for legacy Tor clients
     - Range: [100, 1000]
@@ -1024,6 +1202,7 @@ These are sorted in order of importance to tune, most important first.
     - Description: Percentage of the current congestion window to increment
                    by during slow start, every cwnd.
     - Range: [10, 300]
+    - Default: 100
     - Tuning Values: 50,100,200
     - Tuning Notes:
            On the current live network, the algorithms tended to exit slow
@@ -1039,6 +1218,16 @@ These are sorted in order of importance to tune, most important first.
 
 6.5.2. Westwood parameters
 
+  Westwood has runaway conditions. Because the congestion signal threshold of
+  TOR_WESTWOOD is a function of RTT_max, excessive queuing can cause an
+  increase in RTT_max. Additionally, if stream activity is constant, but of
+  a lower bandwidth than the circuit, this will not drive the RTT upwards,
+  and this can result in a congestion window that continues to increase in the
+  absence of any other concurrent activity.
+
+  For these reasons, we are unlikely to spend much time deeply investigating
+  Westwood in Shadow, beyond a simulaton or two to check these behaviors.
+
   cc_westwood_rtt_thresh:
     - Description:
               Specifies the cutoff for BOOTLEG_RTT_TOR to deliver
@@ -1084,7 +1273,6 @@ These are sorted in order of importance to tune, most important first.
 
 6.5.3. Vegas Parameters
 
-
   cc_vegas_alpha:
   cc_vegas_beta:
   cc_vegas_gamma:
@@ -1129,26 +1317,179 @@ These are sorted in order of importance to tune, most important first.
             absent any more agressive competition, we do not need to overshoot
             the BDP estimate.
 
-
 6.5.5. Flow Control Parameters
 
- TODO: These may expand, particularly to include cell_queue_highwater
+  As with previous sections, the parameters in this section are sorted with
+  the parameters that are most impportant to tune, first.
+
+  These parameters have been tuned using onion services. The defaults are
+  believed to be good.
 
-  xoff_client
-  xoff_mobile
-  xoff_exit
-    - Description: Specifies the stream queue size as a percentage of
-                   'cwnd' at an endpoint before an XOFF is sent.
+  cc_xoff_client
+  cc_xoff_exit
+    - Description: Specifies the outbuf length, in relay cell multiples,
+                   before we send an XOFF.
+    - Range: [1, 10000]
+    - Default: 500
+    - Tuning Values: [500, 1000]
+    - Tuning Notes:
+        This threshold plus the sender's cwnd must be greater than the
+        cc_xon_rate value, or a rate cannot be computed. Unfortunately,
+        unless it is sent, the receiver does not know the cwnd. Therefore,
+        this value should always be higher than cc_xon_rate minus 
+        'cc_cwnd_min' (100) minus the xon threshhold value (0).
+
+  cc_xon_rate
+    - Description: Specifies how many full packed cells of bytes must arrive
+                   before we can compute a rate, as well as how often we can
+                   send XONs.
+    - Range: [1, 5000]
+    - Default: 500
+    - Tuning Values: [500, 1000]
+    - Tuning Notes:
+        Setting this high will prevent excessive XONs, as well as reduce
+        side channel potential, but it will delay response to queuing.
+        and will hinder our ability to detect rate changes. However, low
+        values will also reduce our ability to accurately measure drain
+        rate. This value should always be lower than 'cc_xoff_*' +
+        'cc_cwnd_min', so that a rate can be computed solely from the outbuf
+        plus inflight data.
+
+ cc_xon_change_pct
+    - Description: Specifies how much the edge drain rate can change before
+                   we send another advisory cell.
     - Range: [1, 100]
-    - Default: 5
+    - Default: 25
+    - Tuning values: [25, 50, 75]
+    - Tuning Notes:
+        Sending advisory updates due to a rate change may help us avoid
+        hitting the XOFF limit, but it may also not help much unless we
+        are already above the advise limit.
+
+  cc_xon_ewma_cnt
+    - Description: Specifies the N in the N_EWMA of rates.
+    - Range: [2, 100]
+    - Default: 2
+    - Tuning values: [2, 3, 5]
+    - Tuning Notes:
+        Setting this higher will smooth over changes in the rate field,
+        and thus avoid XONs, but will reduce our reactivity to rate changes.
+ 
+
+6.5.6. External Performance Parameters to Tune
+
+  The following parameters are from other areas of Tor, but tuning them
+  will improve congestion control performance. They are again sorted
+  by most important to tune, first.
+
+  cbtquantile
+    - Description: Specifies the percentage cutoff for the circuit build
+                   timeout mechanism.
+    - Range: [60, 80]
+    - Default: 80    
+    - Tuning Values: [70, 75, 80]
+    - Tuning Notes:
+       The circuit build timeout code causes Tor to use only the fastest
+       'cbtquantile' percentage of paths to build through the network.
+       Lowering this value will help avoid congested relays, and improve
+       latency.
+
+  CircuitPriorityHalflifeMsec
+    - Description: The CircEWMA half-life specifies the time period after
+                   which the cell count on a circuit is halved. This allows
+                   circuits to regain their priority if they stop being bursty.
+    - Range: [1, INT32_MAX]
+    - Default: 30000
+    - Tuning Values: [5000, 15000, 30000, 60000]
+    - Tuning Notes:
+       When we last tuned this, it was before KIST[31], so previous values may
+       have little relevance to today. According to the CircEWMA paper[30], values
+       that are too small will fail to differentiate bulk circuits from interactive
+       ones, and values that are too large will allow new bulk circuits to keep
+       priority over interactive circuits for too long. The paper does say
+       that the system was not overly sensitive to specific values, though.
+
+  CircuitPriorityTickSecs
+    - Description: This specifies how often in seconds we readjust circuit
+                   priority based on their EWMA.
+    - Range: [1, 600]
+    - Default: 10
+    - Tuning Values: [1, 5, 10]
+    - Tuning Notes:
+        Even less is known about the optimal value for this parameter. At a
+        guess, it should be more often than the half-life. Changing it also
+        influences the half-life decay, though, at least according to the
+        CircEWMA paper[30].
+
+  KISTSchedRunInterval
+    - If 0, KIST is disabled. (We should also test KIST disabled)
 
-  xon_client
-  xon_mobile
-  xon_exit
-    - Description: Specifies the how many cells below xoff_* before
-                   an XON is sent from an endpoint.
-    - Range: [1, 10000000]
-    - Default: 10000
+
+6.5.7. External Memory Reduction Parameters to Tune
+
+  The following parameters are from other areas of Tor, but tuning them
+  will reduce memory utilization in relays. They are again sorted by most
+  important to tune, first.
+
+  circ_max_cell_queue_size
+    - Description: Specifies the minimum number of cells that are allowed
+                   to accumulate in a relay queue before closing the circuit.
+    - Range: [1000, INT32_MAX]
+    - Default: 50000
+    - Tuning Values: [1000, 2500, 5000]
+    - Tuning Notes:
+       Once all clients have upgraded to congestion control, relay circuit
+       queues should be minimized. We should minimize this value, as any
+       high amounts of queueing is a likely violator of the algorithm.
+
+  cellq_low
+  cellq_high
+    - Description: Specifies the number of cells that can build up in
+                   a circuit's queue for delivery onto a channel (from edges)
+                   before we either block or unblock reading from streams
+                   attached to that circuit.
+    - Range: [1, 1000]
+    - Default: low=10, high=256
+    - Tuning Values: low=[0, 2, 4, 8]; high=[16, 32, 64]
+    - Tuning Notes:
+        When data arrives from edges into Tor, it gets packaged up into cells
+        and then delivered to the cell queue, and from there is dequeued and
+        sent on a channel. If the channel has blocked (see below params), then
+        this queue grows until the high watermark, at which point Tor stops
+        reading on all edges associated with a circuit, and a congestion
+        signal is delivered to that circuit. At 256 cells, this is ~130k of
+        data for *every* circuit, which is far more than Tor can write in a
+        channel outbuf. Lowering this will reduce latency, reduce memory
+        usage, and improve responsiveness to congestion. However, if it is
+        too low, we may incur additional mainloop invocations, which are
+        expensive. We will need to trace or monitor epoll() invocations in
+        Shadow or on a Tor exit to verify that low values do not lead to
+        more mainloop invocations.
+
+  orconn_high
+  orconn_low
+    - Description: Specifies the number of bytes that can be held in an
+                   orconn's outbuf before we block or unblock the orconn.
+    - Range: [509, INT32_MAX]
+    - Default: low=16k, high=32k
+    - Tuning Notes:
+        When the orconn's outbuf is above the high watermark, cells begin
+        to accumulate in the cell queue as opposed to being added to the
+        outbuf. It may make sense to lower this to be more in-line with the
+        cellq values above. Also note that the low watermark is only used by
+        the vanilla scheduler, so tuning it may be relevant when we test with
+        KIST disabled. Just like the cell queue, if this is set lower, congestion
+        signals will arrive sooner to congestion control when orconns become
+        blocked, and less memory will occupy queues. It will also reduce latency.
+        Note that if this is too low, we may not fill TLS records, and we may
+        incur excessive epoll()/mainloop invocations. Tuning this is likely
+        less beneficial than tuning the above cell_queue, unless KIST is
+        disabled.
+
+  MaxMemInQueues
+    - Should be possible to set much lower, similarly to help with
+      OOM conditions due to protocol violation. Unfortunately, this
+      is just a torrc, and a bad one at that.
 
 
 7. Protocol format specifications [PROTOCOL_SPEC]
@@ -1514,3 +1855,15 @@ as well as review of our versions of the algorithms.
 
 27. Exponentially Weighted Moving Average
     https://corporatefinanceinstitute.com/resources/knowledge/trading-investing/exponentially-weighted-moving-average-ewma/
+
+28. Dropping on the Edge
+    https://www.petsymposium.org/2018/files/papers/issue2/popets-2018-0011.pdf
+
+29. https://github.com/mikeperry-tor/vanguards/blob/master/README_TECHNICAL.md#the-bandguards-subsystem
+
+30. An Improved Algorithm for Tor Circuit Scheduling.
+    https://www.cypherpunks.ca/~iang/pubs/ewma-ccs.pdf
+
+31. KIST: Kernel-Informed Socket Transport for Tor
+    https://matt.traudt.xyz/static/papers/kist-tops2018.pdf
+



More information about the tor-commits mailing list