# [or-cvs] r18851: {projects} section 4.1 (projects/performance)

arma at seul.org arma at seul.org
Tue Mar 10 10:29:03 UTC 2009

Author: arma
Date: 2009-03-10 06:29:03 -0400 (Tue, 10 Mar 2009)
New Revision: 18851

Modified:
projects/performance/performance.tex
Log:
section 4.1

Modified: projects/performance/performance.tex
===================================================================
--- projects/performance/performance.tex	2009-03-10 09:43:34 UTC (rev 18850)
+++ projects/performance/performance.tex	2009-03-10 10:29:03 UTC (rev 18851)
@@ -52,8 +52,9 @@

\maketitle

-The Tor network's performance is a victim of its own success. As we've
-improved Tor's installers and its usability in general, and as people
+As the Tor project has grown, the performance of the Tor network has
+suffered. We've focused on
+improving Tor's installers and its usability in general, and as people
increasingly realize that having some privacy online is a good move, more
people are using Tor. Further, Tor's flexibility in terms of handling
many protocols has bitten us because a small number of people are loading
@@ -65,7 +66,8 @@
and lays out our options for fixing it.

The Tor network is slow right now for six main reasons. We discuss each
-reason in its own section, starting with the most severe. For each reason,
+reason in its own section, starting with the most severe in terms of
+long-term effect. For each reason,
we explain our current intuition for how to fix it, how effective we
think each fix would be, how much effort and risk is involved, and the
recommended next steps.
@@ -443,7 +445,8 @@
Overall, the challenge of users who want to overload the system will
continue. Tor is not the only system that faces this challenge.

-\section{Simply not enough capacity}
+\section{The Tor network doesn't have enough capacity}
+\label{sec:capacity}

\prettyref{sec:congestion} aims to let web browsing
connections work better in the face of high-volume streams, and
@@ -734,34 +737,73 @@
before we can roll this out, or we'll just make a bunch of Windows
machines crash.

-\section{Choosing paths imperfectly}
+\section{Tor clients choose paths imperfectly}

+Even when we sort out the congestion control issues, the problem of users
+abusing the network with too much traffic, and the question of overall
+capacity, we still face a fourth problem. Users need to choose their
+paths in such a way that everybody is using the network efficiently.
+
+Right now, Tor relays estimate their capacity by observing the largest
+traffic burst they've seen themselves do in the past day. They advertise
+that bandwidth capacity in the directory information, and clients weight
+their path selection by the bandwidth of each relay. For example, a relay
+that advertises 100KB/s peak bandwidth will be chosen twice as often as
+a relay that advertises 50KB/s peak bandwidth.
+
+There are several problems with our current algorithm that are worth
+fixing.
+
\subsection{We don't balance traffic over our bandwidth numbers correctly}

-Currently Tor selects relays with a probability proportional to their bandwidth contribution to the network, however this may not be the optimal algorithm.
-Murdoch and Watson~\cite{murdoch-pet2008} investigated the performance impact of different relay selection algorithms, and derived a formula for estimating average latency $T$:
+Selecting relays with a probability proportional to their bandwidth
+contribution to the network may not be the optimal algorithm. Murdoch
+and Watson~\cite{murdoch-pet2008} investigated the performance impact of
+different relay selection algorithms, and came up with a model to describe
+the optimal path selection strategies based on how loaded the network is.

-$$-T = \sum_{i=1}^n q_i t_i = \sum_{i=1}^n \frac{q_i x_i (2 - q_i x_i \Lambda)}{2 (1 - q_i x_i \Lambda)} -\label{eqn:waiting} -$$
+Tor's current selection strategy is optimal when the network is
+fully loaded. That is, if every single byte is going to be used, then
+weighting by capacity is the right way to do it. But if the network is
+not fully loaded, then the fast relays end up with less load than the
+slow relays. To compensate, clients should pick faster relays with
+higher probability.

-Where $q_i$ is the probability of the $i$th relay (out of $n$ relays) being selected, $t_i$ is the average latency at the $i$th relay, $x_i$ is the reciprocal of the $i$th relay's bandwidth, and $\Lambda$ is the total network load.
+In particular, we can estimate the network load because all Tor relays
+publish both their capacity and usage in their relay descriptor (but
+see the next section for problems that crop up there). The Tor network
+is currently loaded at around 50\%. This level is much higher than most
+reasonable networks, indicating that our plan in \prettyref{sec:capacity}
+to get more overall capacity is a good one. But 50\% is quite far from
+100\% when it becomes to optimal load balancing.

-This calculation is subject to a number of assumptions.
-In particular, it assumes that Tor relays have infinite length queues and input traffic is Poisson distributed.
-Whereas in practise Tor relays have finite length queues (which controls network load), and the distribution of input cells is not known.
-Unfortunately, these assumptions are necessary to apply standard queueing theory results.
+%, and derived a formula
+%for estimating average latency $T$:

-Despite the simplifications made to the network model, results derived from it may still be useful.
-This is especially the case because it models the entire network, whereas experiments can feasibly change only a few of the clients' behaviour.
-The formula is also amenable to mathematical analysis such as non-linear optimization.
+%$$+%T = \sum_{i=1}^n q_i t_i = \sum_{i=1}^n \frac{q_i x_i (2 - q_i x_i \Lambda)}{2 (1 - q_i x_i \Lambda)} +%\label{eqn:waiting} +%$$

-To try and find the optimum relay selection probabilities, I used a hill-climbing algorithm to minimize network latency, with a Tor directory snapshot as input.
-The results (shown in \prettyref{fig:optimum-selection} and \prettyref{fig:relative-selection}) depend on the network load relative to overall capacity.
-As load approaches capacity, the optimum selection probabilities converge to the one used by Tor: relay bandwidth proportional to network capacity.
-However, as load drops, the optimized selection algorithm favours slow relays less and faster relays more; many relays are not used at all.
+%Where $q_i$ is the probability of the $i$th relay (out of $n$ relays) being selected, $t_i$ is the average latency at the $i$th relay, $x_i$ is the reciprocal of the $i$th relay's bandwidth, and $\Lambda$ is the total network load.

+%This calculation is subject to a number of assumptions.
+%In particular, it assumes that Tor relays have infinite length queues and input traffic is Poisson distributed.
+%Whereas in practice Tor relays have finite length queues (which controls network load), and the distribution of input cells is not known.
+%Unfortunately, these assumptions are necessary to apply standard queueing theory results.
+
+To find the optimum relay selection probabilities the model, Steven
+used a hill-climbing algorithm to minimize network latency, with a Tor
+directory snapshot as input.
+The results (shown in \prettyref{fig:optimum-selection} and
+\prettyref{fig:relative-selection}) depend on the network load relative
+to overall capacity.
+As load approaches capacity, the optimum selection probabilities
+converge to the one currently used by Tor: relay bandwidth proportional
+to network capacity.
+However, as load drops, the optimized selection algorithm favors slow
+relays less and faster relays more; many relays are not used at all.
+
\begin{figure}
\includegraphics[width=\textwidth]{node-selection/optimum-selection-probabilities}
\caption{Optimum relay selection probabilities for a variety of network loads. Tor is currently at around 50\% utilization. The relay selection probabilities currently used by Tor are shown in black.}
@@ -774,33 +816,75 @@
\label{fig:relative-selection}
\end{figure}

-The relay selection probabilities discussed above are tuned to a particular level of network load.
-It is possible to estimate network load because all Tor relays report back both their capacity and usage in their descriptor.
-However, this estimate may be inaccurate and it is possible that the network load will change over time.
-In which case the relay selection algorithm chosen will no longer be optimal.
-\prettyref{fig:vary-load} shows how average network latency is affected by relay selection probabilities, for different levels of network load.
+Anecdotal evidence supports the theory that the fast relays in the Tor
+network have more spare capacity than they should. Several users have
+posted that they get much better Tor performance if they hard-code their
+paths to only use the fastest ten relays (and ignore the huge anonymity
+implications, of course).
+%an alternate strategy for relay selection that put even more weight on
+%the fast relays, and justified their design by trying it on the current
+%Tor network. Our current theory is that their alternate strategy puts
+%too much weight on the fast relays if all clients were doing it, and
+%it only looked good because the current network is unbalanced; but as
+%usual more research remains to get better intuition there.

-As can be seen, small changes in network load do not significantly affect the network latency, and for all load levels examined, the optimized selection probabilities do offer lower latency when compared to the Tor selection algorithm.
-However, each probability distribution has a cut-off point at which at least one relay will have a higher load than its capacity, at which its queue length, and hence latency, will become infinite.
-For the optimized selection probability distributions, this cut-off point is a few percent above the level they were designed to operate at.
-For the Tor selection algorithm, it is when the overall network capacity equals the overall network load.
+The relay selection probabilities in these graphs are tuned to a
+average network latency is affected by relay selection probabilities,
+the optimized selection probabilities offer lower latency when compared
+to Tor's current selection algorithm. However, there's a downside to
+tailoring for a particular load level: if we see a much heavier load in
+practice than the one we had in mind when we tuned our selection biases,
+then we end up overbalancing the network in the other direction.

-In this respect the Tor selection algorithm reaches the theoretical optimum, as no network can operate at greater than 100\% utilization while maintaining finite latency.
-However, in a real instantiation of any of these alternative probability distributions, the network latency would not become infinite; instead a connection would time out and a different circuit would be selected.
-So in practise, if the wrong probability distribution was selected, the network would converge at a different one.
-Unfortunately the standard queuing theory models cannot handle this case and it seems likely that a simulation would be required to estimate the effect.
-
\begin{figure}
\caption{Average network latency against network load. Three relay selection probabilities are shown, optimized for 50\%, 75\%, and 90\% network load. The Tor relay selection algorithm is also included (black). The dots on the $x$ axis show the level of network load at which the relay selection probability distributions are optimized for. The line is cut off when the model predicts that at least one relay will have an infinite queue length, which occurs before load $=$ capacity for all relay selection algorithms except for Tor's current one.}
\end{figure}

+Specifically, each probability distribution has a cut-off point at
+which (according to the model) at least one relay will have a higher
+load than its capacity, at which its queue length, and hence latency,
+will become infinite.
+For the optimized selection probability distributions, this cut-off
+point is a few percent above the level they were designed to operate at.
+For Tor's current selection algorithm, it is when the overall network
+capacity equals the overall network load.

-\subsection{The bandwidth numbers we get aren't very accurate either}
+In this respect the Tor selection algorithm reaches the theoretical
+optimum, as no network can operate at greater than 100\% utilization
+while maintaining finite latency.
+However, in a real instantiation of any of these alternative probability
+distributions, the network latency would not become infinite; instead
+a connection would time out and a different circuit would be selected.
+So in practice, if the wrong probability distribution was selected,
+the network would converge at a different one.
+Unfortunately the standard queuing theory models cannot handle this
+case; we need to move to a simulation rather than using equations
+and assumptions, to estimate the real effect.

+{\bf Impact}: Low-medium.
+
+{\bf Effort}: Medium.
+
+{\bf Risk}: Low.
+
+{\bf Plan}: It seems clear that some adjustments should be done in
+terms of biasing selection toward the faster relays. The exact load
+level to anticipate remains an open question though. Fortunately, in
+our new networkstatus algorithm, the directory authorities declare the
+bandwidths for each relay. So we can just reweight them on the fly and
+clients will use the new numbers. That means once enough clients have
+upgraded to using the bandwidths specified in the networkstatus, we can
+start to experiment with shifting the biases and see what results we get.
+
+\subsection{The bandwidth estimates we have aren't very accurate}
+
Peer-to-peer bandwidth estimation

Snader and Borisov~\cite{tuneup} proposed that each Tor relay opportunistically monitor the data rates that it achieves when communicating with other Tor relays.
@@ -808,6 +892,10 @@
If each Tor relay reported their measurements back to a central authority, it would be possible to estimate the capacity of each Tor relay.
This estimate would be difficult to game, when compared to the current self-advertisement of bandwidth capacity.

+Despite the simplifications made to the network model, results derived from it may still be useful.
+This is especially the case because it models the entire network, whereas experiments can feasibly change only a few of the clients' behaviour.
+The formula is also amenable to mathematical analysis such as non-linear optimization.
+
Experiments show that opportunistic bandwidth measurement has a better
systematic error than Tor's current self-advertised measure, although
has a poorer log-log correlation (0.48 vs. 0.57).
@@ -835,14 +923,14 @@
%% Micah Sherr tells me that latency and geolocation distance are
%% pretty much not correlated. -RD

-Micah Sherr is working on a thesis at Penn under Matt Blaze, to explore
+Micah Sherr is working on a thesis at Penn under Matt Blaze, that explores
exactly this issue. He suggests to use a virtual coordinate system --
a three or four dimension space such that distance between relays in
-virtual coordinate space corresponds to the network latency (or other
+the virtual coordinate space corresponds to the network latency (or other
metric) between them.

His experiments show that we could see a significant speedup in the Tor
-network is users choose their paths based on this new relay selection
+network if users choose their paths based on this new relay selection
algorithm.

%A second option would be to actually measure hop latency, and publish
@@ -876,9 +964,9 @@
When selecting an exit relay for a circuit, a Tor client will build a list
of all exit relays which can carry the desired stream, then select from
them with a probability weighted by each relay's capacity\footnote{The
-actual algorithm is slightly more complex, in particular exit relays which
-are also guard relays will be weighted less, and there is also preemptive
-circuit creation}.
+actual algorithm is slightly more complex: in particular, exit relays which
+are also guard relays will be weighted less, and some circuits are
+created preemptively without any destination port in mind.}.
This means that relays with more permissive exit policies will be
candidates for more circuits, and hence will be more heavily loaded
compared to relays with restrictive policies.