Dear all,
Over the past six years or so, the total population of Tor relays has capped out around 7,000, even as the total data carried by the Tor network has increased fourfold. This is not great, since it suggests an intrinsic limit to the number of Tor relays: a 'carrying capacity', in the sense of population dynamics.
It is worth understanding the limiting factor, and there are several valid hypotheses. One is that the number of relays is determined by something extrinsic, for example, not enough people volunteering to run Tor relays to support continued growth. However, the increasing competitiveness for relays and de facto minimum requirements for entering the Tor consensus would seem to weaken this argument. Also, if this were true, it would seem sensible to expect that the number of Tor relays would be a random walk: sometimes new people choose to run relays, and sometimes people choose to stop running relays, and it just so happens that we've reached an equilibrium, with an OU process around that equilibrium point.
An alternative hypothesis is that there is some intrinsic phenomenon that is limiting the number of relays: that for every relay that is added to the consensus, it is more difficult for the population of relays to increase further. We can test this hypothesis by estimating the Hurst exponent for the number of relays over time, using the hourly observations in the consensus records on collector.torproject.org.
We can also drill into this question further: Is the increase in the number of relays limited by something that is making it harder for new relays to enter the consensus, is it limited by something that is making it more likely that existing relays would be removed from the consensus, or neither? We can explore this by examining the churn as a function of the total population. If churn as a proportion of the total population increases with the size of the population, then it is harder for new relays to enter the consensus as the population increases; if the churn decreases with the size of the population, then it is more likely for existing relays to fall out of the consensus as the population increases.
We explore both questions below; see [code] and [output].
It turns out that the Hurst exponent is consistently about 0.3 for each of the past six years, suggesting that the reversion is statistically significant at the three-sigma level. So, it is not a random walk: with every relay that enters the consensus, it is harder for the next relay to enter the consensus. This observation supports the argument that the limit to the number of Tor relays is endogenous, not exogenous.
We also examined the churn as a function of population size. It turns out that the relationship between churn and population size is much weaker. There is a small positive effect of population size on churn rate, although our analysis suggests that population size explains only about 6% of the churn rate and is not a particularly important factor. Also, the effect has not been consistent across years; for example, in 2018, the effect was reversed. We can conclude that the effect of population size on churn is small.
So what is causing it to be difficult for new routers to be added to the consensus? Personally, I am partial to the conjecture that the cause is the number of pairwise TCP connections that the network must maintain to continue functioning. Every router must connect to every other router, so the n^2 complexity seems like an obvious limit to scalability. But this is not the only possible explanation. Perhaps there is some limit to the consensus process wherein the cadence of successive consensuses is simply too fast to accommodate a sufficiently large number of routers. Or maybe something about the way relays are evaluated or selected by clients tends toward approving only a certain number.
Suggest that we explore this further. Suggest that if we want to see 10,000 Tor relays some day, we will need to identify the limiting factor first.
Jeremy Bentham
[code]
#!/usr/bin/python3 import hurst import pandas import matplotlib from numpy import arange from scipy import stats matplotlib.use('agg') import matplotlib.pyplot as plt def desc(sdf, name): # Hurst exponent n = len(sdf) mu = sdf['n'].mean() try: H, c, val = hurst.compute_Hc(sdf['n']) print("Hurst exponent for %s (n=%d, mu=%f): %0.6f" % (name, n, mu, H)) except: pass # churn scatter plot x = sdf['n'] y = list(map(lambda x: float(x)/2, sdf['in']+sdf['out'])) x, y = zip(*filter( lambda i: i[0] > 5000 and i[1] < 1200, zip(x, y) )) y = list(map(lambda a, b: b/a, x, y)) slope, intercept, r_value, p_value, std_err = stats.linregress(x, y) print("slope=%s, intercept=%s, r_value=%s, p_value=%s, std_err=%s\n" % ( slope, intercept, r_value, p_value, std_err )) slope, intercept, r_value, p_value, std_err = stats.linregress(x, y) line_x = arange(min(x), max(x)) line_y = slope*line_x+intercept plt.figure(figsize=(10,10)) plt.scatter(x, y, c=range(len(x)), cmap='cool') plt.plot( line_x, line_y, label='$%.2fx + %.2f$, $R^2=%.2f$' % (slope, intercept, r_value**2) ) plt.legend(loc='best') plt.title("Tor Relay Churn %s" % name) plt.xlabel("Relay Population") plt.ylabel("Hourly Churn") plt.tight_layout() plt.savefig("scatter-%s.png" % name) plt.clf() plt.close() df = pandas.read_csv("delta.csv", names=['dt', 'n', 'in', 'out'], index_col=0) for y in range(2015, 2022): sdf = df["%d-01-01" % y:"%d-01-01" % (y+1)] desc(sdf, y) print("----") sdf = df["2015-01-01":] desc(sdf, "2015-2021")
[output]
Hurst exponent for 2015 (n=8751, mu=6688.496400): 0.292922 slope=6.828983625513537e-06, intercept=0.03150033656688232, r_value=0.24326729749329457, p_value=4.610820027076726e-118, std_err=2.911196095555647e-07
Hurst exponent for 2016 (n=8773, mu=7118.746153): 0.280950 slope=6.52665957746843e-06, intercept=0.026579771388178977, r_value=0.13035103343825263, p_value=1.5008838134149767e-34, std_err=5.300969996645152e-07
Hurst exponent for 2017 (n=8756, mu=6983.947579): 0.363264 slope=-5.6911267269861e-06, intercept=0.11128434282175681, r_value=-0.3197803122785664, p_value=2.4559989048776613e-207, std_err=1.802368096605417e-07
Hurst exponent for 2018 (n=8760, mu=6340.653425): 0.273098 slope=-1.8991031221395063e-05, intercept=0.19279021686921058, r_value=-0.4143375031745955, p_value=0.0, std_err=4.45750735830525e-07
Hurst exponent for 2019 (n=8754, mu=6422.183459): 0.281159 slope=2.4396305737493384e-06, intercept=0.05420159095635396, r_value=0.1028780872185714, p_value=4.925103966921193e-22, std_err=2.5213725374993213e-07
Hurst exponent for 2020 (n=8780, mu=6315.970046): 0.253770 slope=3.773262957205024e-06, intercept=0.04167914097261437, r_value=0.14185475596231883, p_value=5.692359499532742e-39, std_err=2.874775311046666e-07
Hurst exponent for 2021 (n=1574, mu=6921.173443): 0.361722 slope=-4.583624673579448e-07, intercept=0.07325673164319593, r_value=-0.008646008325663415, p_value=0.7324417790859385, std_err=1.3404752250927981e-06
---- Hurst exponent for 2015-2021 (n=54148, mu=6653.012096): 0.301609 slope=1.2316718714554112e-06, intercept=0.06349668094844796, r_value=0.06418146747221241, p_value=3.5734133468741144e-50, std_err=8.260683313218318e-08
P.S. Apologies: The last sentence of my fourth paragraph is exactly backwards, although it does not change the meaning of my message.
Also, this is how I generated 'delta.csv' from the consensus files:
#!/bin/zsh files=("${(@f)$(find . -name "*-consensus" | sort)}") for (( i=1; i<${#files[@]}-1; i++ )); do if (( i == 2 )); then a=$(grep "^r " ${files[i]}) fi b=$(grep "^r " ${files[i+1]}) d=$(diff <(echo $a) <(echo $b)) echo $( sed -e "s/^.*/([0-9]*)-([0-9]*)-([0-9]*)-([0-9]*)-([0-9]*)-([0-9]*)-consensus/\1-\2-\3 \4:\5:\6/g" <(echo ${files[i+1]}) ),$( echo $b | wc -l ),$( grep "^> " <(echo $d) | wc -l ),$( grep "^< " <(echo $d) | wc -l ) a=$b done
On Fri, March 19, 2021 9:02 am, bentham@danwin1210.me wrote:
Dear all,
Over the past six years or so, the total population of Tor relays has capped out around 7,000, even as the total data carried by the Tor network has increased fourfold. This is not great, since it suggests an intrinsic limit to the number of Tor relays: a 'carrying capacity', in the sense of population dynamics.
It is worth understanding the limiting factor, and there are several valid hypotheses. One is that the number of relays is determined by something extrinsic, for example, not enough people volunteering to run Tor relays to support continued growth. However, the increasing competitiveness for relays and de facto minimum requirements for entering the Tor consensus would seem to weaken this argument. Also, if this were true, it would seem sensible to expect that the number of Tor relays would be a random walk: sometimes new people choose to run relays, and sometimes people choose to stop running relays, and it just so happens that we've reached an equilibrium, with an OU process around that equilibrium point.
An alternative hypothesis is that there is some intrinsic phenomenon that is limiting the number of relays: that for every relay that is added to the consensus, it is more difficult for the population of relays to increase further. We can test this hypothesis by estimating the Hurst exponent for the number of relays over time, using the hourly observations in the consensus records on collector.torproject.org.
We can also drill into this question further: Is the increase in the number of relays limited by something that is making it harder for new relays to enter the consensus, is it limited by something that is making it more likely that existing relays would be removed from the consensus, or neither? We can explore this by examining the churn as a function of the total population. If churn as a proportion of the total population increases with the size of the population, then it is harder for new relays to enter the consensus as the population increases; if the churn decreases with the size of the population, then it is more likely for existing relays to fall out of the consensus as the population increases.
We explore both questions below; see [code] and [output].
It turns out that the Hurst exponent is consistently about 0.3 for each of the past six years, suggesting that the reversion is statistically significant at the three-sigma level. So, it is not a random walk: with every relay that enters the consensus, it is harder for the next relay to enter the consensus. This observation supports the argument that the limit to the number of Tor relays is endogenous, not exogenous.
We also examined the churn as a function of population size. It turns out that the relationship between churn and population size is much weaker. There is a small positive effect of population size on churn rate, although our analysis suggests that population size explains only about 6% of the churn rate and is not a particularly important factor. Also, the effect has not been consistent across years; for example, in 2018, the effect was reversed. We can conclude that the effect of population size on churn is small.
So what is causing it to be difficult for new routers to be added to the consensus? Personally, I am partial to the conjecture that the cause is the number of pairwise TCP connections that the network must maintain to continue functioning. Every router must connect to every other router, so the n^2 complexity seems like an obvious limit to scalability. But this is not the only possible explanation. Perhaps there is some limit to the consensus process wherein the cadence of successive consensuses is simply too fast to accommodate a sufficiently large number of routers. Or maybe something about the way relays are evaluated or selected by clients tends toward approving only a certain number.
Suggest that we explore this further. Suggest that if we want to see 10,000 Tor relays some day, we will need to identify the limiting factor first.
Jeremy Bentham
[code]
#!/usr/bin/python3 import hurst import pandas import matplotlib from numpy import arange from scipy import stats matplotlib.use('agg') import matplotlib.pyplot as plt
def
desc(sdf, name): # Hurst exponent n = len(sdf) mu = sdf['n'].mean() try: H, c, val = hurst.compute_Hc(sdf['n']) print("Hurst exponent for %s (n=%d, mu=%f): %0.6f" % (name, n, mu, H)) except: pass # churn scatter plot x = sdf['n'] y = list(map(lambda x: float(x)/2, sdf['in']+sdf['out'])) x, y = zip(*filter( lambda i: i[0] > 5000 and i[1] < 1200, zip(x, y) )) y = list(map(lambda a, b: b/a, x, y)) slope, intercept, r_value, p_value, std_err = stats.linregress(x, y) print("slope=%s, intercept=%s, r_value=%s, p_value=%s, std_err=%s\n" % ( slope, intercept, r_value,
p_value,
std_err )) slope, intercept, r_value, p_value, std_err = stats.linregress(x, y) line_x = arange(min(x), max(x)) line_y = slope*line_x+intercept plt.figure(figsize=(10,10)) plt.scatter(x, y, c=range(len(x)), cmap='cool') plt.plot( line_x, line_y, label='$%.2fx + %.2f$, $R^2=%.2f$' % (slope, intercept, r_value**2) ) plt.legend(loc='best') plt.title("Tor Relay Churn %s" % name) plt.xlabel("Relay Population") plt.ylabel("Hourly Churn") plt.tight_layout() plt.savefig("scatter-%s.png" % name) plt.clf() plt.close() df = pandas.read_csv("delta.csv", names=['dt', 'n', 'in', 'out'], index_col=0) for y in range(2015, 2022): sdf = df["%d-01-01" % y:"%d-01-01" % (y+1)] desc(sdf, y) print("----") sdf = df["2015-01-01":] desc(sdf, "2015-2021")
[output]
Hurst exponent for 2015 (n=8751, mu=6688.496400): 0.292922 slope=6.828983625513537e-06, intercept=0.03150033656688232, r_value=0.24326729749329457, p_value=4.610820027076726e-118, std_err=2.911196095555647e-07
Hurst exponent for 2016 (n=8773, mu=7118.746153): 0.280950 slope=6.52665957746843e-06, intercept=0.026579771388178977, r_value=0.13035103343825263, p_value=1.5008838134149767e-34, std_err=5.300969996645152e-07
Hurst exponent for 2017 (n=8756, mu=6983.947579): 0.363264 slope=-5.6911267269861e-06, intercept=0.11128434282175681, r_value=-0.3197803122785664, p_value=2.4559989048776613e-207, std_err=1.802368096605417e-07
Hurst exponent for 2018 (n=8760, mu=6340.653425): 0.273098 slope=-1.8991031221395063e-05, intercept=0.19279021686921058, r_value=-0.4143375031745955, p_value=0.0, std_err=4.45750735830525e-07
Hurst exponent for 2019 (n=8754, mu=6422.183459): 0.281159 slope=2.4396305737493384e-06, intercept=0.05420159095635396, r_value=0.1028780872185714, p_value=4.925103966921193e-22, std_err=2.5213725374993213e-07
Hurst exponent for 2020 (n=8780, mu=6315.970046): 0.253770 slope=3.773262957205024e-06, intercept=0.04167914097261437, r_value=0.14185475596231883, p_value=5.692359499532742e-39, std_err=2.874775311046666e-07
Hurst exponent for 2021 (n=1574, mu=6921.173443): 0.361722 slope=-4.583624673579448e-07, intercept=0.07325673164319593, r_value=-0.008646008325663415, p_value=0.7324417790859385, std_err=1.3404752250927981e-06
Hurst exponent for 2015-2021 (n=54148, mu=6653.012096): 0.301609 slope=1.2316718714554112e-06, intercept=0.06349668094844796, r_value=0.06418146747221241, p_value=3.5734133468741144e-50, std_err=8.260683313218318e-08
network-health mailing list network-health@lists.torproject.org https://lists.torproject.org/cgi-bin/mailman/listinfo/network-health
network-health@lists.torproject.org