[tor-commits] [torspec/master] Modify 210-faster...consensus-bootstrap for exponential backoff
nickm at torproject.org
nickm at torproject.org
Wed Oct 21 15:55:32 UTC 2015
Author: teor (Tim Wilson-Brown) <teor2345 at gmail.com>
Date: Fri Oct 2 15:46:54 2015 +0200
Modify 210-faster...consensus-bootstrap for exponential backoff
To implement #4483 we need to contact multiple directory mirrors
to increase bootstrap reliability. This patch implements the
exponential backoff suggested in
The patch also analyses the reliability of the new scheme, and
compares it to the current Tor implementation.
.../210-faster-headless-consensus-bootstrap.txt | 124 ++++++++++++++------
1 file changed, 86 insertions(+), 38 deletions(-)
diff --git a/proposals/210-faster-headless-consensus-bootstrap.txt b/proposals/210-faster-headless-consensus-bootstrap.txt
index 6b1502b..42726e5 100644
@@ -1,9 +1,10 @@
Title: Faster Headless Consensus Bootstrapping
-Author: Mike Perry
+Author: Mike Perry, Tim Wilson-Brown, Peter Palfrader
+Last Modified: 02-10-2015
Overview and Motiviation
@@ -19,19 +20,39 @@ Design: Bootstrap Process Changes
parallel during the bootstrap process, and download the consensus from
the first connection that completes.
- Connection attempts will be done in batches of three. Only one
- connection will be performed to one of the canonical directory
- authorities. Two connections will be performed to randomly chosen hard
- coded directory mirrors.
- If no connections complete within 5 seconds, another batch of three
- connections will be launched. Otherwise, the first connection to
- complete will be used to download the consensus document and the others
- will be closed, after which bootstrapping will proceed as normal.
+ Connection attempts will be performed on an exponential backoff basis.
+ Initially, connections will be performed to randomly chosen hard
+ coded directory mirrors. If none of these connections complete within
+ 5 seconds, connections will also be performed to randomly chosen
+ canonical directory authorities.
+ We specify that mirror connections retry after half a second, and then
+ double the retry time with every connection:
+ 0, 0.5, 1, 2, 4, 8, 16, ...
+ We specify that directory authority connections start after a 5 second
+ delay, and retry after 5 seconds, doubling the retry time with every
+ 5, 10, 20, ...
+ The first connection to complete will be used to download the consensus
+ document and the others will be closed, after which bootstrapping will
+ proceed as normal.
+ We expect the vast majority of clients to succeed within 4 seconds,
+ after making up to 5 connection attempts to mirrors. Clients which can't
+ connect in the first 5 seconds, will then try to contact a directory
+ authority. We expect almost all clients to succeed within 10 seconds,
+ after up to 6 connection attempts to mirrors and up to 2 connection
+ attempts to authorities. This is a much better success rate than the
+ current Tor implementation, which fails k/n of clients if k of the n
+ directory authorities are down. (Or, if the connection fails in
+ certain ways, (k/n)^2.)
If at any time, the total outstanding bootstrap connection attempts
- exceeds 15, no new connection attempts are to be launched until existing
- connection attempts experience full timeout.
+ exceeds 10, no new connection attempts are to be launched until an
+ existing connection attempt experiences full timeout. The retry time
+ is not doubled when a connection is skipped.
Design: Fallback Dir Mirror Selection
@@ -43,8 +64,8 @@ Design: Fallback Dir Mirror Selection
This list of fallback dir mirrors should be updated with every
major Tor release. In future releases, the number of dir mirrors
- should be set at 20% of the current Guard nodes, rather than fixed at
+ should be set at 20% of the current Guard nodes (approximately 200 as
+ of October 2015), rather than fixed at 100.
Performance: Additional Load with Current Parameter Choices
@@ -62,19 +83,20 @@ Performance: Additional Load with Current Parameter Choices
The dangerous case is in the event of a prolonged consensus failure
that induces all clients to enter into the bootstrap process. In this
- case, the number of initial TLS connections to the fallback dir mirrors
- would be 2*C/100, or 10,000 for C=500,000 users. If no connections
- complete before the five retries, this could reach as high as 50,000
- connection attempts, but this is extremely unlikely to happen in full
+ case, the number of TLS connections to the fallback dir mirrors within
+ the first second would be 3*C/100, or 60,000 for C=2,000,000 users. If
+ no connections complete before the 10 retries, 7 of which go to
+ mirrors, this could reach as high as 140,000 connection attempts, but
+ this is extremely unlikely to happen in full aggregate.
However, in the no-consensus scenario today, the directory authorities
- would already experience C/9 or 55,555 connection attempts. The
- 5-retry scheme increases their total maximum load to about 275,000
- connection attempts, but again this is unlikely to be reached
- in aggregate. Additionally, with this scheme, even if the dirauths
- are taken down by this load, the dir mirrors should be able to survive
+ would already experience 2*C/9 or 444,444 connection attempts. (Tor
+ currently tries 2 authorities, before delaying the next attempt.) The
+ 10-retry scheme, 3 of which go to authorities, increases their total
+ maximum load to about 666,666 connection attempts, but again this is
+ unlikely to be reached in aggregate. Additionally, with this scheme,
+ even if the dirauths are taken down by this load, the dir mirrors
+ should be able to survive it.
Implementation Notes: Code Modifications
@@ -87,12 +109,16 @@ Implementation Notes: Code Modifications
a single directory server is selected and a connection is
eventually made through directory_initiate_command_rend().
- There appear to be a few options for altering this code to perform
- multiple connections. Without refactoring, one approach would be
- to make multiple calls to directory_initiate_command_routerstatus()
- from directory_get_from_dirserver() if the purpose is
+ There appear to be a few options for altering this code to retry multiple
+ simultaneous connections. Without refactoring, one approach would be to
+ set mirror and authority retry helper function timers in
+ directory_initiate_command_routerstatus() from
+ directory_get_from_dirserver() if the purpose is
DIR_PURPOSE_FETCH_CONSENSUS and the only directory servers available
- are the authorities and the fallback dir mirrors.
+ are the authorities and the fallback dir mirrors. (That is, there is no
+ valid consensus.) The retry helper function would check the list of
+ pending connections and, if it is 10 or greater, skip the connection
+ attempt, and leave the retry time constant.
The code in directory_initiate_command_rend() would then need to be
altered to maintain a list of the dircons created for this purpose as
@@ -104,10 +130,32 @@ Implementation Notes: Code Modifications
altered to examine the list of pending dircons, determine if this one is
the first to complete, and if so, then call directory_send_command() to
download the consensus and close the other pending dircons.
- An additional timer would need to be installed to re-call
- update_consensus_networkstatus_downloads() or a related helper after 5
- seconds. connection_dir_finished_connecting() would cancel this timer.
- The helper would check the list of pending connections and ensure it
- never exceeds 15.
+ connection_dir_finished_connecting() would also cancel both timers.
+ We make the pessimistic assumptions that 50% of connections to directory
+ mirrors fail, and that 20% of connections to authorities fail. (Actual
+ figures depend on relay churn, age of the fallback list, and authority
+ We expect the first 10 connection retry times to be:
+ Mirror: 0s 0.5s 1s 2s 4s 8s 16s
+ Auth: 5s 10s 20s
+ Success: 50% 75% 87% 94% 97% 99.4% 99.7% 99.94% 99.97% 99.99%
+ 97% of clients succeed while only using directory mirrors.
+ 2.4% of clients succeed on their first auth connection.
+ 0.24% of clients succeed after one more mirror and auth connection.
+ 0.05% of clients succeed after two more mirror and auth connections.
+ 0.01% of clients remain, but in this scenario, 3 authorities are down,
+ so the client is most likely blocked from the Tor network.
+ The current implementation makes 1 or 2 authority connections within the
+ first second, depending on exactly how the first connection fails. Under
+ the 20% authority failure assumption, these clients would have a success
+ rate of either 80% or 96% within a few seconds. The scheme above has a
+ similar success rate in the first few seconds, while spreading the load
+ among a larger number of directory mirrors. In addition, if all the
+ authorities are blocked, current clients will inevitably fail, as they
+ do not have a list of directory mirrors.
More information about the tor-commits