[network-health] estimating the carrying capacity for Tor relays

Fri Mar 19 13:31:48 UTC 2021

P.S. Apologies: The last sentence of my fourth paragraph is exactly
backwards, although it does not change the meaning of my message.

Also, this is how I generated 'delta.csv' from the consensus files:

#!/bin/zsh
files=("${(@f)$(find . -name "*-consensus" | sort)}")
for (( i=1; i<${#files[@]}-1; i++ )); do
    if (( i == 2 )); then
        a=$(grep "^r " ${files[i]})
    fi
    b=$(grep "^r " ${files[i+1]})
    d=$(diff <(echo $a) <(echo $b))
    echo $(
        sed -e
"s/^.*\/\([0-9]*\)-\([0-9]*\)-\([0-9]*\)-\([0-9]*\)-\([0-9]*\)-\([0-9]*\)-consensus/\1-\2-\3
\4:\5:\6/g" <(echo ${files[i+1]})
    ),$(
        echo $b | wc -l
    ),$(
        grep "^> " <(echo $d) | wc -l
    ),$(
        grep "^< " <(echo $d) | wc -l
    )
    a=$b
done

On Fri, March 19, 2021 9:02 am, bentham at danwin1210.me wrote:
> Dear all,
>
>
> Over the past six years or so, the total population of Tor relays has
> capped out around 7,000, even as the total data carried by the Tor network
>  has increased fourfold.  This is not great, since it suggests an
> intrinsic limit to the number of Tor relays: a 'carrying capacity', in the
> sense of population dynamics.
>
> It is worth understanding the limiting factor, and there are several
> valid hypotheses.  One is that the number of relays is determined by
> something extrinsic, for example, not enough people volunteering to run
> Tor relays
> to support continued growth.  However, the increasing competitiveness for
> relays and de facto minimum requirements for entering the Tor consensus
> would seem to weaken this argument.  Also, if this were true, it would
> seem sensible to expect that the number of Tor relays would be a random
> walk: sometimes new people choose to run relays, and sometimes people
> choose to stop running relays, and it just so happens that we've reached an
> equilibrium, with an OU process around that equilibrium point.
>
> An alternative hypothesis is that there is some intrinsic phenomenon that
>  is limiting the number of relays: that for every relay that is added to
> the consensus, it is more difficult for the population of relays to
> increase further.  We can test this hypothesis by estimating the Hurst
> exponent for the number of relays over time, using the hourly
> observations in the consensus records on collector.torproject.org.
>
> We can also drill into this question further: Is the increase in the
> number of relays limited by something that is making it harder for new
> relays to enter the consensus, is it limited by something that is making
> it more likely that existing relays would be removed from the consensus,
> or neither?  We can explore this by examining the churn as a function of
> the total population.  If churn as a proportion of the total population
> increases with the size of the population, then it is harder for new
> relays to enter the consensus as the population increases; if the churn
> decreases with the size of the population, then it is more likely for
> existing relays to fall out of the consensus as the population increases.
>
>
> We explore both questions below; see [code] and [output].
>
>
> It turns out that the Hurst exponent is consistently about 0.3 for each
> of the past six years, suggesting that the reversion is statistically
> significant at the three-sigma level.  So, it is not a random walk: with
> every relay that enters the consensus, it is harder for the next relay to
>  enter the consensus.  This observation supports the argument that the
> limit to the number of Tor relays is endogenous, not exogenous.
>
> We also examined the churn as a function of population size.  It turns
> out that the relationship between churn and population size is much
> weaker. There is a small positive effect of population size on churn rate,
>  although our analysis suggests that population size explains only about
> 6%
> of the churn rate and is not a particularly important factor.  Also, the
> effect has not been consistent across years; for example, in 2018, the
> effect was reversed.  We can conclude that the effect of population size
> on churn is small.
>
> So what is causing it to be difficult for new routers to be added to the
> consensus?  Personally, I am partial to the conjecture that the cause is
> the number of pairwise TCP connections that the network must maintain to
> continue functioning.  Every router must connect to every other router,
> so the n^2 complexity seems like an obvious limit to scalability.  But
> this is not the only possible explanation.  Perhaps there is some limit to
> the consensus process wherein the cadence of successive consensuses is
> simply too fast to accommodate a sufficiently large number of routers.  Or
> maybe something about the way relays are evaluated or selected by clients
> tends toward approving only a certain number.
>
> Suggest that we explore this further.  Suggest that if we want to see
> 10,000 Tor relays some day, we will need to identify the limiting factor
> first.
>
> Jeremy Bentham
>
>
> [code]
>
>
> #!/usr/bin/python3
> import hurst import pandas import matplotlib from numpy import arange from
> scipy import stats matplotlib.use('agg') import matplotlib.pyplot as plt
def
> desc(sdf, name): # Hurst exponent
> n = len(sdf) mu = sdf['n'].mean() try:
> H, c, val = hurst.compute_Hc(sdf['n'])
> print("Hurst exponent for %s (n=%d, mu=%f): %0.6f" % (name, n, mu, H))
> except:
> pass # churn scatter plot
> x = sdf['n'] y = list(map(lambda x: float(x)/2, sdf['in']+sdf['out'])) x, y
> = zip(*filter(
> lambda i: i[0] > 5000 and i[1] < 1200, zip(x, y) ))
> y = list(map(lambda a, b: b/a, x, y)) slope, intercept, r_value, p_value,
> std_err = stats.linregress(x, y) print("slope=%s, intercept=%s,
> r_value=%s, p_value=%s, std_err=%s\n" % ( slope, intercept, r_value,
p_value,
>  std_err ))
> slope, intercept, r_value, p_value, std_err = stats.linregress(x, y) line_x
> = arange(min(x), max(x))
> line_y = slope*line_x+intercept plt.figure(figsize=(10,10)) plt.scatter(x,
> y, c=range(len(x)), cmap='cool') plt.plot( line_x, line_y, label='$%.2fx +
> %.2f$, $R^2=%.2f$' % (slope, intercept, r_value**2)
> )
> plt.legend(loc='best') plt.title("Tor Relay Churn %s" % name)
> plt.xlabel("Relay Population") plt.ylabel("Hourly Churn")
> plt.tight_layout() plt.savefig("scatter-%s.png" % name) plt.clf()
> plt.close() df = pandas.read_csv("delta.csv", names=['dt', 'n', 'in',
> 'out'],
> index_col=0) for y in range(2015, 2022): sdf = df["%d-01-01" % y:"%d-01-01"
> % (y+1)]
> desc(sdf, y) print("----") sdf = df["2015-01-01":] desc(sdf, "2015-2021")
>
> [output]
>
>
> Hurst exponent for 2015 (n=8751, mu=6688.496400): 0.292922
> slope=6.828983625513537e-06, intercept=0.03150033656688232,
> r_value=0.24326729749329457, p_value=4.610820027076726e-118,
> std_err=2.911196095555647e-07
>
> Hurst exponent for 2016 (n=8773, mu=7118.746153): 0.280950
> slope=6.52665957746843e-06, intercept=0.026579771388178977,
> r_value=0.13035103343825263, p_value=1.5008838134149767e-34,
> std_err=5.300969996645152e-07
>
> Hurst exponent for 2017 (n=8756, mu=6983.947579): 0.363264
> slope=-5.6911267269861e-06, intercept=0.11128434282175681,
> r_value=-0.3197803122785664, p_value=2.4559989048776613e-207,
> std_err=1.802368096605417e-07
>
> Hurst exponent for 2018 (n=8760, mu=6340.653425): 0.273098
> slope=-1.8991031221395063e-05, intercept=0.19279021686921058,
> r_value=-0.4143375031745955, p_value=0.0, std_err=4.45750735830525e-07
>
> Hurst exponent for 2019 (n=8754, mu=6422.183459): 0.281159
> slope=2.4396305737493384e-06, intercept=0.05420159095635396,
> r_value=0.1028780872185714, p_value=4.925103966921193e-22,
> std_err=2.5213725374993213e-07
>
> Hurst exponent for 2020 (n=8780, mu=6315.970046): 0.253770
> slope=3.773262957205024e-06, intercept=0.04167914097261437,
> r_value=0.14185475596231883, p_value=5.692359499532742e-39,
> std_err=2.874775311046666e-07
>
> Hurst exponent for 2021 (n=1574, mu=6921.173443): 0.361722
> slope=-4.583624673579448e-07, intercept=0.07325673164319593,
> r_value=-0.008646008325663415, p_value=0.7324417790859385,
> std_err=1.3404752250927981e-06
>
> ----
> Hurst exponent for 2015-2021 (n=54148, mu=6653.012096): 0.301609
> slope=1.2316718714554112e-06, intercept=0.06349668094844796,
> r_value=0.06418146747221241, p_value=3.5734133468741144e-50,
> std_err=8.260683313218318e-08
>
>
> _______________________________________________
> network-health mailing list network-health at lists.torproject.org
> https://lists.torproject.org/cgi-bin/mailman/listinfo/network-health
>
>