[network-health] estimating the carrying capacity for Tor relays

Fri Mar 19 09:02:46 UTC 2021

Dear all,

Over the past six years or so, the total population of Tor relays has
capped out around 7,000, even as the total data carried by the Tor network
has increased fourfold.  This is not great, since it suggests an intrinsic
limit to the number of Tor relays: a 'carrying capacity', in the sense of
population dynamics.

It is worth understanding the limiting factor, and there are several valid
hypotheses.  One is that the number of relays is determined by something
extrinsic, for example, not enough people volunteering to run Tor relays
to support continued growth.  However, the increasing competitiveness for
relays and de facto minimum requirements for entering the Tor consensus
would seem to weaken this argument.  Also, if this were true, it would
seem sensible to expect that the number of Tor relays would be a random
walk: sometimes new people choose to run relays, and sometimes people
choose to stop running relays, and it just so happens that we've reached
an equilibrium, with an OU process around that equilibrium point.

An alternative hypothesis is that there is some intrinsic phenomenon that
is limiting the number of relays: that for every relay that is added to
the consensus, it is more difficult for the population of relays to
increase further.  We can test this hypothesis by estimating the Hurst
exponent for the number of relays over time, using the hourly observations
in the consensus records on collector.torproject.org.

We can also drill into this question further: Is the increase in the
number of relays limited by something that is making it harder for new
relays to enter the consensus, is it limited by something that is making
it more likely that existing relays would be removed from the consensus,
or neither?  We can explore this by examining the churn as a function of
the total population.  If churn as a proportion of the total population
increases with the size of the population, then it is harder for new
relays to enter the consensus as the population increases; if the churn
decreases with the size of the population, then it is more likely for
existing relays to fall out of the consensus as the population increases.

We explore both questions below; see [code] and [output].

It turns out that the Hurst exponent is consistently about 0.3 for each of
the past six years, suggesting that the reversion is statistically
significant at the three-sigma level.  So, it is not a random walk: with
every relay that enters the consensus, it is harder for the next relay to
enter the consensus.  This observation supports the argument that the
limit to the number of Tor relays is endogenous, not exogenous.

We also examined the churn as a function of population size.  It turns out
that the relationship between churn and population size is much weaker. 
There is a small positive effect of population size on churn rate,
although our analysis suggests that population size explains only about 6%
of the churn rate and is not a particularly important factor.  Also, the
effect has not been consistent across years; for example, in 2018, the
effect was reversed.  We can conclude that the effect of population size
on churn is small.

So what is causing it to be difficult for new routers to be added to the
consensus?  Personally, I am partial to the conjecture that the cause is
the number of pairwise TCP connections that the network must maintain to
continue functioning.  Every router must connect to every other router, so
the n^2 complexity seems like an obvious limit to scalability.  But this
is not the only possible explanation.  Perhaps there is some limit to the
consensus process wherein the cadence of successive consensuses is simply
too fast to accommodate a sufficiently large number of routers.  Or maybe
something about the way relays are evaluated or selected by clients tends
toward approving only a certain number.

Suggest that we explore this further.  Suggest that if we want to see
10,000 Tor relays some day, we will need to identify the limiting factor
first.

Jeremy Bentham

[code]

#!/usr/bin/python3
import hurst
import pandas
import matplotlib
from numpy import arange
from scipy import stats
matplotlib.use('agg')
import matplotlib.pyplot as plt
def desc(sdf, name):
    # Hurst exponent
    n = len(sdf)
    mu = sdf['n'].mean()
    try:
        H, c, val = hurst.compute_Hc(sdf['n'])
        print("Hurst exponent for %s (n=%d, mu=%f): %0.6f" % (name, n, mu,
H))
    except:
        pass
    # churn scatter plot
    x = sdf['n']
    y = list(map(lambda x: float(x)/2, sdf['in']+sdf['out']))
    x, y = zip(*filter(
        lambda i: i[0] > 5000 and i[1] < 1200,
        zip(x, y)
    ))
    y = list(map(lambda a, b: b/a, x, y))
    slope, intercept, r_value, p_value, std_err = stats.linregress(x, y)
    print("slope=%s, intercept=%s, r_value=%s, p_value=%s, std_err=%s\n" % (
        slope,
        intercept,
        r_value,
        p_value,
        std_err
    ))
    slope, intercept, r_value, p_value, std_err = stats.linregress(x, y)
    line_x = arange(min(x), max(x))
    line_y = slope*line_x+intercept
    plt.figure(figsize=(10,10))
    plt.scatter(x, y, c=range(len(x)), cmap='cool')
    plt.plot(
        line_x,
        line_y,
        label='$%.2fx + %.2f$, $R^2=%.2f$' % (slope, intercept, r_value**2)
    )
    plt.legend(loc='best')
    plt.title("Tor Relay Churn %s" % name)
    plt.xlabel("Relay Population")
    plt.ylabel("Hourly Churn")
    plt.tight_layout()
    plt.savefig("scatter-%s.png" % name)
    plt.clf()
    plt.close()
df = pandas.read_csv("delta.csv", names=['dt', 'n', 'in', 'out'],
index_col=0)
for y in range(2015, 2022):
    sdf = df["%d-01-01" % y:"%d-01-01" % (y+1)]
    desc(sdf, y)
print("----")
sdf = df["2015-01-01":]
desc(sdf, "2015-2021")

[output]

Hurst exponent for 2015 (n=8751, mu=6688.496400): 0.292922
slope=6.828983625513537e-06, intercept=0.03150033656688232,
r_value=0.24326729749329457, p_value=4.610820027076726e-118,
std_err=2.911196095555647e-07

Hurst exponent for 2016 (n=8773, mu=7118.746153): 0.280950
slope=6.52665957746843e-06, intercept=0.026579771388178977,
r_value=0.13035103343825263, p_value=1.5008838134149767e-34,
std_err=5.300969996645152e-07

Hurst exponent for 2017 (n=8756, mu=6983.947579): 0.363264
slope=-5.6911267269861e-06, intercept=0.11128434282175681,
r_value=-0.3197803122785664, p_value=2.4559989048776613e-207,
std_err=1.802368096605417e-07

Hurst exponent for 2018 (n=8760, mu=6340.653425): 0.273098
slope=-1.8991031221395063e-05, intercept=0.19279021686921058,
r_value=-0.4143375031745955, p_value=0.0, std_err=4.45750735830525e-07

Hurst exponent for 2019 (n=8754, mu=6422.183459): 0.281159
slope=2.4396305737493384e-06, intercept=0.05420159095635396,
r_value=0.1028780872185714, p_value=4.925103966921193e-22,
std_err=2.5213725374993213e-07

Hurst exponent for 2020 (n=8780, mu=6315.970046): 0.253770
slope=3.773262957205024e-06, intercept=0.04167914097261437,
r_value=0.14185475596231883, p_value=5.692359499532742e-39,
std_err=2.874775311046666e-07

Hurst exponent for 2021 (n=1574, mu=6921.173443): 0.361722
slope=-4.583624673579448e-07, intercept=0.07325673164319593,
r_value=-0.008646008325663415, p_value=0.7324417790859385,
std_err=1.3404752250927981e-06

----
Hurst exponent for 2015-2021 (n=54148, mu=6653.012096): 0.301609
slope=1.2316718714554112e-06, intercept=0.06349668094844796,
r_value=0.06418146747221241, p_value=3.5734133468741144e-50,
std_err=8.260683313218318e-08