[tor-relays] Measuring the Accuracy of Tor Relays' Advertised Bandwidths

Fri Jul 26 14:18:24 UTC 2019

Hello relay operators!

I am planning on performing an experiment on the Tor network to try to gauge the accuracy of the advertised bandwidths that relays report in their server descriptors. Briefly, the experiment involves running a speed test on every relay for a short time (about 20 seconds). Details follow.

I plan to run the experiment in about 1 week. Relay operators can opt-out of the speed test by replying on this thread, and we will remove you from the list of relays to scan.

Peace, love, and positivity,
Rob

---
Measuring the Accuracy of Tor Relays' Advertised Bandwidths

Motivation
----------
The capacity of Tor relays (maximum available goodput) is an important metric. Combined with mean goodput, it allows us to compute the bandwidth utilization of individual relays as well as the entire network in aggregate. Generally, capacity is used to help balance client load across relays, and relay utilization rates help Tor make informed decisions about how to allocate resources and prioritize performance and scalability improvements.

Problem
-------
Currently, Tor uses a heuristic measure of unknown accuracy to estimate Tor relay capacity. Each relay keeps track of the maximum goodput it has achieved over any 10 second window in a 24 hour period. This is called the "observed bandwidth". Relays take the minimum of their "observed bandwidth" and their bandwidth rate-limiting configuration and reports the result as the "advertised bandwidth" in their server descriptors. We do not know how well the advertised bandwidth estimates the true relay capacity, but we do know that it represents a lower bound on capacity.

Hypothesis
----------
The advertised bandwidth significantly underestimates the true capacity of Tor relays. On average, relays with higher true capacities will be more strongly correlated with capacity underestimation (because it will be less likely that fast relays will have sustained their full capacity over a 10 second period).

Experiment
----------
A relay reports its advertised bandwidth in its server descriptor. To test how well these reported numbers represent the true capacity of a relay, we can manually perform a speed test on the relay by initiating the simultaneous download of several large data streams for a period that exceeds 10 seconds. In the report following our test, the relay will report its advertised bandwidth in its server descriptor and the results will be collected and reported by metrics.torproject.org.

The experiment involves two steps: running the speed test on a relay under our control, and running the speed test on all relays in Tor network.

We will first run the speed test on at least one relay that we control, in order to test that the method is effective and that we can in fact observe a change in the advertised bandwidth reported on metrics.torproject.org. Once we have confidence that our speed test is functioning correctly, and that the metrics pipeline will allow us to gather the results, we will repeat it on all relays in the network.

We will conduct the speed tests while minimizing network overhead. We will use a custom client that builds 2-relay circuits. The first relay will be the target relay we are speed testing, and the second relay will be a fast exit relay that we control. We will initiate data streams between a speedtest client and server running on the same machine as our exit relay.

The setup will look like:

speedtest-client <--> tor-client <--> target-relay <--> exit-relay <--> speedtest-server

All components will run on the same machine that we control except for the target-relay, which will rotate as we test different relays in the network. For each target relay, we plan to run the speedtest for 20 seconds in order to increase the probability that the 10 second mean goodput will reach the true capacity. We will measure each relay over a few days to ensure that our speedtest effects are reported by every relay.

Although we believe that the overhead of this speed test is in line with regular usage, relay operators can opt-out of the speed test by replying on this thread. Those that opt out will be removed from our list of relays to scan.

Analysis
--------
Following our speedtest, we will analyze the data collected and reported by Tor metrics. We will compared the advertised bandwidth that each relay reports before our experiment to those reported during our experiment. This will help us test our hypothesis that relays' advertised bandwidth underestimates the true capacity of relays. We will run a statistical correlation analysis on the data to test the strength of the correlation between the previously reported (estimated) relay capacity and relay capacity underestimation. We will report our results to the Tor community.

We expect that the results of our experiment will help Tor decide how to allocate resources and will help them plan and prioritize performance improvements. It will also provide insight into the operation of the current load balancing system, which uses advertised bandwidth to produce consensus weights.