# [or-cvs] r18842: {projects} some cleanups, plus a terminology shift node->relay (projects/performance)

arma at seul.org arma at seul.org
Tue Mar 10 04:14:02 UTC 2009

Author: arma
Date: 2009-03-10 00:14:02 -0400 (Tue, 10 Mar 2009)
New Revision: 18842

Modified:
projects/performance/performance.tex
Log:
some cleanups, plus a terminology shift node->relay

Modified: projects/performance/performance.tex
===================================================================
--- projects/performance/performance.tex	2009-03-10 04:02:54 UTC (rev 18841)
+++ projects/performance/performance.tex	2009-03-10 04:14:02 UTC (rev 18842)
@@ -79,12 +79,12 @@

One of Tor's critical performance problems is in how it combines
high-volume streams with low-volume streams. We need to come up with ways
-to let the quiet'' streams (\eg web browsing) co-exist better with the
-loud'' streams (\eg bulk transfer).
+to let the quiet'' streams (like web browsing) co-exist better with the
+loud'' streams (like bulk transfer).

\subsection{TCP backoff slows down every circuit at once}

-Tor combines every circuit going between two Tor relays into a single TCP
+Tor combines all the circuits going between two Tor relays into a single TCP
connection. This approach is a smart idea in terms of anonymity, since
putting all circuits on the same connection prevents an observer from
learning which packets correspond to which circuit. But over the past
@@ -110,8 +110,8 @@

There have been two proposals to resolve this problem, but their
underlying principle is the same: use an unreliable protocol for links
-between Tor nodes, and perform error recovery and congestion management
-between the client and exit node. Tor partially funded Joel Reardon's
+between Tor relays, and perform error recovery and congestion management
+between the client and exit relay. Tor partially funded Joel Reardon's
thesis~\cite{reardon-thesis} under Ian Goldberg. His thesis proposed
using DTLS~\cite{DTLS}
(a UDP variant of TLS) as the link protocol and a cut-down version of
@@ -145,7 +145,7 @@
to the IKE key exchange protocol to support onion routing.
As with the proposal from Reardon, there is a risk of operating system
and machine fingerprinting from exposing the client TCP stack to the
-exit node.
+exit relay.
This could be resolved in a similar way, by implementing a user-mode
IPsec stack, but this would be a substantial effort, and would lose some
of the advantages of making use of existing building blocks.
@@ -213,7 +213,7 @@
for the circuit window, and whether it should vary.
Out of 200, 1\,000 (current value in Tor) and 5\,000, the optimum was
200 for all levels of packet loss.
-However this was only evaluated for a fixed network latency and node
+However this was only evaluated for a fixed network latency and relay
bandwidth.
Therefore, a different optimum may exist for networks with different
characteristics.
@@ -482,7 +482,7 @@
{\bf Plan}: A clear win. We should do as many advocacy aspects as we
can fit in.

+\subsubsection{Talks and trainings}

One of the best ways we've found for getting new relays is to go to
conferences and talk to people in person. There are many thousands of
@@ -493,16 +493,18 @@
Roger and Jake have been working on this angle, and Jake will be ramping
up even more on it in 2009.

-Advocacy and education is particularly important in the context of
-new and quickly-changing policies. In particular, the data retention
-question in Germany is causing some instability in the overall set
-of volunteers running relays. Karsten's latest metrics\footnote{\url
-https://www.torproject.org/projects/metrics} show that while the number
+Advocacy and education is especially important in the context of new and
+quickly-changing government policies. In particular, the data retention
+question in Germany is causing instability in the overall set
+of volunteers willing to run relays. Karsten's latest
+metrics\footnote{\url{https://www.torproject.org/projects/metrics}}
+show that while the number
of relays in other countries held steady or went up during 2008, the
numbers in Germany went down over the course of 2008. On the other hand,
the total amount of bandwidth provided by German relays held steady
during 2008 -- so while other operators picked up the slack, we still
-lose overall diversity of relays.
+lost overall diversity of relays. These results tell us where to focus
+our efforts.

\subsubsection{Better support for relay operators}

@@ -541,19 +543,19 @@
get credit for their contribution to the Tor network. This would raise
awareness for Tor, and encourage others to operate relays.

-Opportunities for expansion include allowing node operators to form
+Opportunities for expansion include allowing relay operators to form
teams'', and for these teams to be ranked on the contribution to the
network. This competition may give more encouragement for team members to
increase their contribution to the network. Also, when one of the team
-members has their node fail, other team members may notice and provide
+members has their relay fail, other team members may notice and provide
assistance on fixing the problem.

\subsection{Funding more relays directly}

-Another option is to directly pay for hosting fees for fast relays (or
+Another option is to directly pay hosting fees for fast relays (or
to directly sponsor others to run them).

-The main challenge with this approach is that the efficiency is low: at
+The main problem with this approach is that the efficiency is low: at
even cheap hosting rates, the cost of a significant number of new relays
grows quickly. For example, if we can find 100 non-exit relays providing
1MB/s for as low as \$100/mo, that's \$120k per year. Figure some more
@@ -567,8 +569,8 @@

Plus the costs just keep coming, month after month.

-Overall, it seems more sustainable to invest in community outreach and
-education.
+Overall, it seems more sustainable to invest in better code, and community
+outreach and education.

{\bf Impact}: Medium.

@@ -590,8 +592,7 @@
Nick has been adapting libevent so it can handle a buffer-based
abstraction rather than the traditional Unix-style socket-based
abstraction. Then we will modify Tor to use this new abstraction. Nick's
-blog post\footnote{\url
-https://blog.torproject.org/blog/some-notes-progress-iocp-and-libevent}
+blog post\footnote{\url{https://blog.torproject.org/blog/some-notes-progress-iocp-and-libevent}}
provides more detail.

{\bf Impact}: Medium.
@@ -604,7 +605,7 @@
works for Nick) out in September 2009. Then iterate until it works
for everybody.

-\subsection{Node scanning to find overloaded nodes or broken exits}
+\subsection{Relay scanning to find overloaded relays or broken exits}

Part of the reason that Tor is slow is because some of the relays are
advertising more bandwidth than they can realistically handle. These
@@ -615,10 +616,8 @@
their connection attempts.

Mike has been working on tools to
-identify these relays: SpeedRacer\footnote{\url
-and SoaT\footnote{\url
Once the tools are further refined, we should be able to figure out if
there are general classes of problems (load balancing, common usability
problems, etc) that mean we should modify our design to compensate. The
@@ -656,7 +655,7 @@
is, and advertise this in its descriptor.
Clients then ignore relays with volatile IP addresses and old descriptor.
Similarly, directory authorities could prioritise the distribution of
-updated IP addresses for freshly changed nodes.
+updated IP addresses for freshly changed relays.

As a last note here, we currently have some bugs that are causing relays
with dynamic IP addresses to fall out of the network entirely. If a
@@ -674,8 +673,9 @@

\subsection{Incentives to relay}

-Our blog post on this topic\footnote{\url
-https://blog.torproject.org/blog/two-incentive-designs-tor} explains
+Our blog post on this
+topic\footnote{\url{https://blog.torproject.org/blog/two-incentive-designs-tor}}
+explains
our work to-date on this topic. The current situation is that we have
two designs to consider: one that's quite simple but has a serious
anonymity problem, and one that's quite complex.
@@ -736,54 +736,54 @@

\section{Choosing paths imperfectly}

-\subsection{We don't balance the load over our bandwidth numbers correctly}
+\subsection{We don't balance traffic over our bandwidth numbers correctly}

-Currently Tor selects nodes with a probability proportional to their bandwidth contribution to the network, however this may not be the optimal algorithm.
-Murdoch and Watson~\cite{murdoch-pet2008} investigated the performance impact of different node selection algorithms, and derived a formula for estimating average latency $T$:
+Currently Tor selects relays with a probability proportional to their bandwidth contribution to the network, however this may not be the optimal algorithm.
+Murdoch and Watson~\cite{murdoch-pet2008} investigated the performance impact of different relay selection algorithms, and derived a formula for estimating average latency $T$:

$$T = \sum_{i=1}^n q_i t_i = \sum_{i=1}^n \frac{q_i x_i (2 - q_i x_i \Lambda)}{2 (1 - q_i x_i \Lambda)} \label{eqn:waiting}$$

-Where $q_i$ is the probability of the $i$th node (out of $n$ nodes) being selected, $t_i$ is the average latency at the $i$th node, $x_i$ is the reciprocal of the $i$th node's bandwidth, and $\Lambda$ is the total network load.
+Where $q_i$ is the probability of the $i$th relay (out of $n$ relays) being selected, $t_i$ is the average latency at the $i$th relay, $x_i$ is the reciprocal of the $i$th relay's bandwidth, and $\Lambda$ is the total network load.

This calculation is subject to a number of assumptions.
-In particular, it assumes that Tor nodes have infinite length queues and input traffic is Poisson distributed.
-Whereas in practise Tor nodes have finite length queues (which controls network load), and the distribution of input cells is not known.
+In particular, it assumes that Tor relays have infinite length queues and input traffic is Poisson distributed.
+Whereas in practise Tor relays have finite length queues (which controls network load), and the distribution of input cells is not known.
Unfortunately, these assumptions are necessary to apply standard queueing theory results.

Despite the simplifications made to the network model, results derived from it may still be useful.
This is especially the case because it models the entire network, whereas experiments can feasibly change only a few of the clients' behaviour.
The formula is also amenable to mathematical analysis such as non-linear optimization.

-To try and find the optimum node selection probabilities, I used a hill-climbing algorithm to minimize network latency, with a Tor directory snapshot as input.
+To try and find the optimum relay selection probabilities, I used a hill-climbing algorithm to minimize network latency, with a Tor directory snapshot as input.
The results (shown in \prettyref{fig:optimum-selection} and \prettyref{fig:relative-selection}) depend on the network load relative to overall capacity.
-As load approaches capacity, the optimum selection probabilities converge to the one used by Tor: node bandwidth proportional to network capacity.
-However, as load drops, the optimized selection algorithm favours slow nodes less and faster nodes more; many nodes are not used at all.
+As load approaches capacity, the optimum selection probabilities converge to the one used by Tor: relay bandwidth proportional to network capacity.
+However, as load drops, the optimized selection algorithm favours slow relays less and faster relays more; many relays are not used at all.

\begin{figure}
\includegraphics[width=\textwidth]{node-selection/optimum-selection-probabilities}
-\caption{Optimum node selection probabilities for a variety of network loads. Tor is currently at around 50\% utilization. The node selection probabilities currently used by Tor are shown in black.}
+\caption{Optimum relay selection probabilities for a variety of network loads. Tor is currently at around 50\% utilization. The relay selection probabilities currently used by Tor are shown in black.}
\label{fig:optimum-selection}
\end{figure}

\begin{figure}
\includegraphics[width=\textwidth]{node-selection/relative-selection-probabilities}
-\caption{Difference between Tor's current node selection probabilities and the optimum, for a variety of network loads. For Tor's current network load ($\approx 50$\%) shown in pink, the slowest nodes are not used at all, and the slower nodes are favoured less.}
+\caption{Difference between Tor's current relay selection probabilities and the optimum, for a variety of network loads. For Tor's current network load ($\approx 50$\%) shown in pink, the slowest relays are not used at all, and the slower relays are favoured less.}
\label{fig:relative-selection}
\end{figure}

-The node selection probabilities discussed above are tuned to a particular level of network load.
-It is possible to estimate network load because all Tor nodes report back both their capacity and usage in their descriptor.
+The relay selection probabilities discussed above are tuned to a particular level of network load.
+It is possible to estimate network load because all Tor relays report back both their capacity and usage in their descriptor.
However, this estimate may be inaccurate and it is possible that the network load will change over time.
-In which case the node selection algorithm chosen will no longer be optimal.
-\prettyref{fig:vary-load} shows how average network latency is affected by node selection probabilities, for different levels of network load.
+In which case the relay selection algorithm chosen will no longer be optimal.
+\prettyref{fig:vary-load} shows how average network latency is affected by relay selection probabilities, for different levels of network load.

As can be seen, small changes in network load do not significantly affect the network latency, and for all load levels examined, the optimized selection probabilities do offer lower latency when compared to the Tor selection algorithm.
-However, each probability distribution has a cut-off point at which at least one node will have a higher load than its capacity, at which its queue length, and hence latency, will become infinite.
+However, each probability distribution has a cut-off point at which at least one relay will have a higher load than its capacity, at which its queue length, and hence latency, will become infinite.
For the optimized selection probability distributions, this cut-off point is a few percent above the level they were designed to operate at.
For the Tor selection algorithm, it is when the overall network capacity equals the overall network load.

@@ -794,7 +794,7 @@

\begin{figure}
-\caption{Average network latency against network load. Three node selection probabilities are shown, optimized for 50\%, 75\%, and 90\% network load. The Tor node selection algorithm is also included (black). The dots on the $x$ axis show the level of network load at which the node selection probability distributions are optimized for. The line is cut off when the model predicts that at least one node will have an infinite queue length, which occurs before load $=$ capacity for all node selection algorithms except for Tor's current one.}
+\caption{Average network latency against network load. Three relay selection probabilities are shown, optimized for 50\%, 75\%, and 90\% network load. The Tor relay selection algorithm is also included (black). The dots on the $x$ axis show the level of network load at which the relay selection probability distributions are optimized for. The line is cut off when the model predicts that at least one relay will have an infinite queue length, which occurs before load $=$ capacity for all relay selection algorithms except for Tor's current one.}
\end{figure}

@@ -803,9 +803,9 @@

Peer-to-peer bandwidth estimation

-Snader and Borisov~\cite{tuneup} proposed that each Tor node opportunistically monitor the data rates that it achieves when communicating with other Tor nodes.
-Since currently Tor uses a clique topology, given enough time, all nodes will communicate with all other Tor nodes.
-If each Tor node reported their measurements back to a central authority, it would be possible to estimate the capacity of each Tor node.
+Snader and Borisov~\cite{tuneup} proposed that each Tor relay opportunistically monitor the data rates that it achieves when communicating with other Tor relays.
+Since currently Tor uses a clique topology, given enough time, all relays will communicate with all other Tor relays.
+If each Tor relay reported their measurements back to a central authority, it would be possible to estimate the capacity of each Tor relay.
This estimate would be difficult to game, when compared to the current self-advertisement of bandwidth capacity.

Experiments show that opportunistic bandwidth measurement has a better
@@ -814,20 +814,20 @@
The most accurate scheme is active probing of capacity, with a log-log
correlation of 0.63, but this introduces network overhead.
All three schemes do suffer from fairly poor accuracy, presumably due
-to some nodes with high variance in bandwidth capacity.
+to some relays with high variance in bandwidth capacity.

\subsection{Bandwidth might not even be the right metric to weight by}

-Currently Tor selects paths purely by the random selection of nodes,
-biased by node bandwidth.
+Currently Tor selects paths purely by the random selection of relays,
+biased by relay bandwidth.
This will sometimes cause high latency circuits due to multiple ocean
-An alternative approach would be to not only bias selection of nodes
+An alternative approach would be to not only bias selection of relays
based on bandwidth, but to also bias the selection of hops based on
expected latency.

%One option would be to predict the latency of hops based on geolocation
%measurement database to be published.
%However, it does assume that the geolocation database is accurate and
@@ -851,84 +851,84 @@
%Alternatively, a central authority could perform the measurements and
%publish the results.
%Performing these measurements would be a $O(n^2)$ problem, where $n$
-%is the number of nodes, so does not scale well.
+%is the number of relays, so does not scale well.

%Publishing a latency database would also increase the size of the
%If na\"{\i}vely implemented, the database would scale with $O(n^2)$.
%However, a more efficient versions could be created, such as by dimension
-%reduction, creating a map in which the distance between any two nodes
+%reduction, creating a map in which the distance between any two relays
%is an approximation of the latency of a hop between them.
%Delta compression could be used if the map changes slowly.

Reducing the number of potential paths would also have anonymity
consequences, and these would need to be carefully considered.
For example, an attacker who wishes to monitor traffic could create
-several nodes, on distinct /16 subnets, but with low latency between them.
+several relays, on distinct /16 subnets, but with low latency between them.
A Tor client trying to minimize latency would be more likely to select
-these nodes for both entry than exit than it would otherwise.
+these relays for both entry than exit than it would otherwise.
This particular problem could be mitigated by selecting entry and
-exit node as normal, and only using latency measurements to select the
-middle node.
+exit relay as normal, and only using latency measurements to select the
+middle relay.

-\subsection{Considering exit policy in node selection}
+\subsection{Considering exit policy in relay selection}

-When selecting an exit node for a circuit, a Tor client will build a list
-of all exit nodes which can carry the desired stream, then select from
-them with a probability weighted by each node's capacity\footnote{The
-actual algorithm is slightly more complex, in particular exit nodes which
-are also guard nodes will be weighted less, and there is also preemptive
+When selecting an exit relay for a circuit, a Tor client will build a list
+of all exit relays which can carry the desired stream, then select from
+them with a probability weighted by each relay's capacity\footnote{The
+actual algorithm is slightly more complex, in particular exit relays which
+are also guard relays will be weighted less, and there is also preemptive
circuit creation}.
-This means that nodes with more permissive exit policies will be
+This means that relays with more permissive exit policies will be
candidates for more circuits, and hence will be more heavily loaded
-compared to nodes with restrictive policies.
+compared to relays with restrictive policies.

\begin{figure}
\includegraphics[width=\textwidth]{node-selection/exit-capacity}
-\caption{Exit node capacity, in terms of number of nodes and advertised
+\caption{Exit relay capacity, in terms of number of relays and advertised
bandwidth for a selection of port numbers.}
\label{fig:exit-capacity}
\end{figure}

-\prettyref{fig:exit-capacity} shows the exit node capacity for a selection
+\prettyref{fig:exit-capacity} shows the exit relay capacity for a selection
of port numbers.
It can be clearly seen that there is a radical difference in the
-availability of nodes for certain ports, generally those not in the
+availability of relays for certain ports, generally those not in the
default exit policy.
Any traffic to these ports will be routed through a small number of exit
-nodes, and if they have a permissive exit policy, they will likely become
+relays, and if they have a permissive exit policy, they will likely become
The extent of this effect will depend on how much traffic in Tor is to
ports which are not in the default exit policy.

-the selection probability of a node based on its exit policy and knowledge
+the selection probability of a relay based on its exit policy and knowledge
While it should improve performance, this modification will make it
-easier for malicious exit nodes to select traffic they wish to monitor.
-For example, an exit node which wants to attack SSH sessions can currently
+easier for malicious exit relays to select traffic they wish to monitor.
+For example, an exit relay which wants to attack SSH sessions can currently
list only port 22 in their exit policy.
Currently they will get a small amount of traffic compared to their
capacity, but with the modification they will get a much larger share
of SSH traffic.
-However a malicious exit node could already do this, by artificially
+However a malicious exit relay could already do this, by artificially

\subsubsection{Further work}

-To properly balance exit node usage, it is necessary to know the usage
+To properly balance exit relay usage, it is necessary to know the usage
of the Tor network, by port.
McCoy \detal~\cite{mccoy-pet2008} have figures for protocol usage in
Tor, but these figures were generated through deep packet inspection,
rather than by port number.
-Furthermore, the exit node they ran used the fairly permissive default
+Furthermore, the exit relay they ran used the fairly permissive default
exit policy.
Therefore, their measurements will underestimate the relative traffic on
ports which are present in the default exit policy, and are also present
in more restrictive policies.
To accurately estimate the Tor network usage by port, it is necessary
-to measure the network usage by port on one or more exit nodes, while
-simultaneously recording the exit policy of all other exit nodes
+to measure the network usage by port on one or more exit relays, while
+simultaneously recording the exit policy of all other exit relays
considered usable.