commit df031d7bf33f68948b2c915d681b08ca59c1da37 Author: Mike Perry mikeperry-git@torproject.org Date: Tue Nov 3 16:28:52 2020 -0600
Update path-spec.txt for CBT estimator fixes for #40168.
Also clarify and improve wording of the timeout calculation section. --- path-spec.txt | 202 ++++++++++++++++++++++++++++++++++------------------------ 1 file changed, 117 insertions(+), 85 deletions(-)
diff --git a/path-spec.txt b/path-spec.txt index 483e37d..c16ef50 100644 --- a/path-spec.txt +++ b/path-spec.txt @@ -361,112 +361,147 @@ of their choices.
2.4. Learning when to give up ("timeout") on circuit construction
- Since version 0.2.2.8-alpha, Tor attempts to learn when to give up on - circuits based on network conditions. + Since version 0.2.2.8-alpha, Tor clients attempt to learn when to give + up on circuits based on network conditions.
-2.4.1. Distribution choice and parameter estimation +2.4.1. Distribution choice
Based on studies of build times, we found that the distribution of - circuit build times appears to be a Frechet distribution. However, - estimators and quantile functions of the Frechet distribution are - difficult to work with and slow to converge. So instead, since we - are only interested in the accuracy of the tail, we approximate - the tail of the distribution with a Pareto curve. + circuit build times appears to be a Frechet distribution (and a multi-modal + Frechet distribution, if more than one guard or bridge is used). However, + estimators and quantile functions of the Frechet distribution are difficult + to work with and slow to converge. So instead, since we are only interested + in the accuracy of the tail, clients approximate the tail of the multi-modal + distribution with a single Pareto curve.
- We calculate the parameters for a Pareto distribution fitting the data - using the estimators in equation 4 from: - http://portal.acm.org/citation.cfm?id=1647962.1648139 +2.4.2. How much data to record + + From our observations, the minimum number of circuit build times for a + reasonable fit appears to be on the order of 100. However, to keep a + good fit over the long term, clients store 1000 most recent circuit build + times in a circular array.
- This is: + The Tor client should build test circuits at a rate of one every 'cbttestfreq' + (10 seconds) until 'cbtmincircs' (100 circuits) are built, with a maximum of + 'cbtmaxopencircs' (default: 10) circuits open at once. This allows a fresh + Tor to have a CircuitBuildTimeout estimated within 30 minutes after install + or network change (see section 2.4.5 below).
- alpha_m = s/(ln(U(X)/Xm^n)) + Timeouts are stored on disk in a histogram of 10ms bin width, the same + width used to calculate the Xm value above. This histogram must be shuffled + after being read from disk, to preserve a proper expiration of old values + after restart.
- where s is the total number of completed circuits we have seen, and + Thus, some build time resolution is lost during restart. Implementations may + choose a different persistence mechanism than this histogram, but be aware + that build time binning is still needed for parameter estimation.
- U(X) = x_max^u * Prod_s{x_i} +2.4.3. Parameter estimation
- with x_i as our i-th completed circuit time, x_max as the longest - completed circuit build time we have yet observed, u as the - number of unobserved timeouts that have no exact value recorded, - and n as u+s, the total number of circuits that either timeout or - complete. + Once 'cbtmincircs' build times are recorded, Tor clients update the + distribution parameters and recompute the timeout every circuit completion + (though see section 2.4.5 for when to pause and reset timeout due to + network change or connectivity loss).
- Using log laws, we compute this as the sum of logs to avoid - overflow and ln(1.0+epsilon) precision issues: + Tor clients calculate the parameters for a Pareto distribution fitting the + data using the maximum likelihood estimator. For derivation, see: + https://en.wikipedia.org/wiki/Pareto_distribution#Estimation_of_parameters
- alpha_m = s/(u*ln(x_max) + Sum_s{ln(x_i)} - n*ln(Xm)) + Because build times are not a true Pareto distribution, we alter how Xm is + computed. In a max likelihood estimator, the mode of the distribution is + used directly as Xm.
- This estimator is closely related to the parameters present in: - http://en.wikipedia.org/wiki/Pareto_distribution#Parameter_estimation - except they are adjusted to handle the fact that our samples are - right-censored at the timeout cutoff. + Instead of using the mode of discrete build times directly, Tor clients + compute the Xm parameter using the weighted average of the the midpoints + of the 'cbtnummodes' (10) most frequently occurring 10ms histogram bins. + (The use of 10 modes was found to minimize error from the selected + cbtquantile, with 10ms bins for quantiles 60-80, compared to many other + heuristics).
- Additionally, because this is not a true Pareto distribution, we alter - how Xm is computed. The Xm parameter is computed as the midpoint of the most - frequently occurring 50ms histogram bin, until the point where 1000 - circuits are recorded. After this point, the weighted average of the top - 'cbtnummodes' (default: 3) midpoint modes is used as Xm. All times below - this value are counted as having the midpoint value of this weighted average - bin. + To avoid ln(1.0+epsilon) precision issues, use log laws to rewrite the + estimator for 'alpha' as the sum of logs followed by subtraction, rather + than multiplication and division: + + alpha = n/(Sum_n{ln(MAX(Xm, x_i))} - n*ln(Xm)) + + In this, n is the total number of build times that have completed, x_i is + the ith recorded build time, and Xm is the modes of x_i as above. + + All times below Xm are counted as having the Xm value via the MAX(), + because in Pareto estimators, Xm is supposed to be the lowest value. + However, since clients use mode averaging to estimatre Xm, there can be + values below our Xm. Effectively, the Pareto estimator then treats that + everything smaller than Xm happened at Xm. One can also see that if + clients did not do this, alpha could underflow to become negative, which + results in an exponential curve, not a Pareto probability distribution.
The timeout itself is calculated by using the Pareto Quantile function (the inverted CDF) to give us the value on the CDF such that 80% of the mass - of the distribution is below the timeout value. + of the distribution is below the timeout value (parameter 'cbtquantile').
- Thus, we expect that the Tor client will accept the fastest 80% of - the total number of paths on the network. + The Pareto Quantile Function (inverse CDF) is:
-2.4.2. How much data to record + F(q) = Xm/((1.0-q)^(1.0/alpha))
- From our observations, the minimum number of circuit build times for a - reasonable fit appears to be on the order of 100. However, to keep a - good fit over the long term, we store 1000 most recent circuit build times - in a circular array. + Thus, clients obtain their circuit build timeout by computing:
- The Tor client should build test circuits at a rate of one per - minute up until 100 circuits are built. This allows a fresh Tor to have - a CircuitBuildTimeout estimated within 1.5 hours after install, - upgrade, or network change (see below). + timeout_ms = F(0.8) # 'cbtquantile' == 0.8
- Timeouts are stored on disk in a histogram of 50ms bin width, the same - width used to calculate the Xm value above. This histogram must be shuffled - after being read from disk, to preserve a proper expiration of old values - after restart. + With this, we expect that the Tor client will accept the fastest 80% of the + total number of paths on the network. + + Clients obtain the circuit close time to completely abandon circuits as: + + close_ms = F(0.99) # 'cbtclosequantile' == 0.99 + + To avoid waiting an unreasonably long period of time for circuits that + simply have relays that are down, Tor clients cap timeout_ms at the max + build time actually observed so far, and cap close_ms at twice this max, + but at least 60 seconds:
-2.4.3. How to record timeouts + timeout_ms = MIN(timeout_ms, max_observed_timeout) + close_ms = MAX(MIN(close_ms, 2*max_observed_timeout), 'cbtinitialtimeout')
- Circuits that pass the timeout threshold should be allowed to continue - building until a time corresponding to the point 'cbtclosequantile' - (default 95) on the Pareto curve, or 60 seconds, whichever is greater. +2.4.4. How to record timeouts
- The actual completion times for these circuits should be recorded. - Implementations should completely abandon a circuit and record a value - as an 'unknown' timeout if the total build time exceeds this threshold. + Pareto estimators begin to lose their accuracy if the tail is omitted. + Hence, Tor clients actually calculate two timeouts: a usage timeout, and a + close timeout.
- The reason for this is that right-censored pareto estimators begin to lose - their accuracy if more than approximately 5% of the values are censored. - Since we wish to set the cutoff at 20%, we must allow circuits to continue - building past this cutoff point up to the 95th percentile. + Circuits that pass the usage timeout are marked as measurement circuits, + and are allowed to continue to build until the close timeout corresponding + to the point 'cbtclosequantile' (default 99) on the Pareto curve, or 60 + seconds, whichever is greater.
-2.4.4. Detecting Changing Network Conditions + The actual completion times for these measurements circuits should be + recorded.
- We attempt to detect both network connectivity loss and drastic + Implementations should completely abandon a circuit and ignore the circuit + if the total build time exceeds the close threshold. Such closed circuits + should be ignored, as this typically means one of the relays in the path is + offline. + +2.4.5. Detecting Changing Network Conditions + + Tor clients attempt to detect both network connectivity loss and drastic changes in the timeout characteristics.
- We assume that we've had network connectivity loss if a circuit - times out and we've received no cells or TLS handshakes since that - circuit began. We then temporarily stop counting timeouts until - network activity resumes. + Clients assume that they have had network connectivity loss if a circuit + times out and have received no cells or TLS handshakes since that + circuit began. Clients then temporarily stop counting timeouts until + network activity resumes (ie: until a TLS handshake completes or a cell + arrives at the client).
- To detect changing network conditions, we keep a history of - the timeout or non-timeout status of the past 20 circuits that - successfully completed at least one hop. If more than 90% of - these circuits timeout, we discard all buildtimes history, reset - the timeout to 60, and then begin recomputing the timeout. + To detect changing network conditions, clients keep a history of + the timeout or non-timeout status of the past 'cbtrecentcount' circuits + (20 circuits) that successfully completed at least one hop. If more than + 90% of these circuits timeout, the client discards all buildtimes history, + resets the timeout to 'cbtinitialtimeout' (60 seconds), and then begins + recomputing the timeout.
- If the timeout was already 60 or higher, we double the timeout. + If the timeout was already 60 or higher, the client doubles the timeout.
-2.4.5. Consensus parameters governing behavior +2.4.6. Consensus parameters governing behavior
Clients that implement circuit build timeout learning should obey the following consensus parameters that govern behavior, in order to allow @@ -483,14 +518,13 @@ of their choices. emergency situations only.
cbtnummodes - Default: 3 + Default: 10 Min: 1 Max: 20 Effect: This value governs how many modes to use in the weighted - average calculation of Pareto parameter Xm. A value of 3 introduces - some bias (2-5% of CDF) under ideal conditions, but allows for better - performance in the event that a client chooses guard nodes of radically - different performance characteristics. + average calculation of Pareto parameter Xm. Selecting Xm as the + average of multiple modes improves accuracy of the Pareto tail + for quantile cutoffs from 60-80% (see cbtquantile).
cbtrecentcount Default: 20 @@ -522,7 +556,7 @@ of their choices. timeout value. It is a percent (10-99).
cbtclosequantile - Default: 95 + Default: 99 Min: Value of cbtquantile parameter Max: 99 Effect: This is the position on the quantile curve to use to set the @@ -530,7 +564,7 @@ of their choices. percent (0-99).
cbttestfreq - Default: 60 + Default: 10 Min: 1 Max: 2147483647 (INT32_MAX) Effect: Describes how often in seconds to build a test circuit to @@ -538,12 +572,10 @@ of their choices. have been recorded.
cbtmintimeout - Default: 2000 - Min: 500 + Default: 10 + Min: 10 Max: 2147483647 (INT32_MAX) Effect: This is the minimum allowed timeout value in milliseconds. - The minimum is to prevent rounding to 0 (we only check once - per second).
cbtinitialtimeout Default: 60000
tor-commits@lists.torproject.org