Mike Perry transcribed 9.3K bytes:
Andrew Lewman:
I had a conversation with a vendor yesterday. They are interested in including Tor as their "private browsing mode" and basically shipping a re-branded tor browser which lets people toggle the connectivity to the Tor network on and off.
They very much like Tor Browser and would like to ship it to their customer base. Their product is 10-20% of the global market, this is of roughly 2.8 billion global Internet users.
As Tor Browser is open source, they are already working on it. However ,their concern is scaling up to handling some percent of global users with "tor mode" enabled. They're willing to entertain offering their resources to help us solve the scalability challenges of handling hundreds of millions of users and relays on Tor.
As this question keeps popping up by the business world looking at privacy as the next "must have" feature in their products, I'm trying to compile a list of tasks to solve to help us scale. The old 2008 three-year roadmap looks at performance, https://www.torproject.org/press/2008-12-19-roadmap-press-release.html.en
I've been through the specs, https://gitweb.torproject.org/torspec.git/tree/HEAD:/proposals to see if there are proposals for scaling the network or directory authorities. I didn't see anything directly related.
The last research paper I see directly addressing scalability is Torsk (http://www.freehaven.net/anonbib/bibtex.html#ccs09-torsk) or PIR-Tor (http://www.freehaven.net/anonbib/bibtex.html#usenix11-pirtor)
These research papers basically propose a total network overhaul to deal with the problem of Tor relay directory traffic overwhelming the Tor network and/or Tor clients.
However, I believe that with only minor modifications, the current Tor network architecture could support 100M daily directly connecting users, assuming we focus our efforts on higher capacity relays and not simply adding tons of slower relays.
The core problem is that the fraction of network capacity that you spend telling users about the current relays in the network can be written as:
f = D*U/B
D is current Tor relay directory size in bytes per day, U is number of users, and B is the bandwidth per day in bytes provided by this Tor network. Of course, this is a simplification, because of multiple directory fetches per day and partially-connecting/idle clients, but for purposes of discussion it is good enough.
To put some real numbers on this, if you compare https://metrics.torproject.org/bandwidth.html#dirbytes with https://metrics.torproject.org/bandwidth.html#bandwidth, you can see that we're currently devoting about 2% of our network throughput to directory activity (~120MiB/sec out of ~5000MiB/sec). So we're not exactly hurting at this point in terms of our directory bytes per user yet.
But, because this is fraction rises with both D and U, these research papers rightly point out that you can't keep adding relays *and* users and expect Tor to scale.
However, when you look at this f=D*U/B formula, what it also says is that if you can reduce the relay directory size by a factor c, and also grow the network capacity by this same factor c, then you can multiply the userbase by c, and have the same fraction of directory bytes.
This means that rather than trying to undertake a major network overhaul like TorSK or PIR-Tor to try to support hundreds of thousands of slow junky relays, we can scale the network by focusing on improving the situation for high capacity relay operators, so we can provide more network bandwidth for the same number of directory bytes per user.
So, let's look at ways to reduce the size of the Tor relay directory, and each way we can find to do so means a corresponding increase in the number of users we can support:
Proper multicore support.
Right now, any relay with more than ~100Mbit of capacity really needs to run an additional tor relay instance on that link to make use of it. If they have AES-NI, this might go up to 300Mbit.
Each of these additional instances is basically wasted directory bytes for those relay descriptors.
But with proper multicore support, such high capacity relays could run only one relay instance on links as fast as 2.5Gbit (assuming an 8 core AES-NI machine).
Result: 2-8X reduction in consensus and directory size, depending on the the number of high capacity relays on multicore systems we have.
Cut off relays below the median capacity, and turn them into bridges.
Relays in the top 10% of the network are 164 times faster than relays in the 50-60% range, 1400 times faster than relays in the 70-80% range, and 35000 times faster than relays in the 90-100% range.
In fact, many relays are so slow that they provide less bytes to the network than it costs to tell all of our users about them. There should be a sweet spot where we can set this cutoff such that the overhead from directory activity balances the loss of capacity from these relays, as a function of userbase size.
Result: ~2X reduction in consensus and directory size.
It's super frustrating when I publicly tell people that ― as much as we <3 them for running a relay ― doing so on a home connection, on wimpy hardware like Raspberry Pis, is likely only going to harm the Tor network. And then people point at "If you have at least 100 kilobytes/s each way, please help out Tor by configuring your Tor to be a relay" on our website [0] and stop listening to whatever other relay-running advice I have to give.
So... here's the background on the "sweet spot" Mike was talking about, and why he stated: "[...]many relays are so slow that they provide less bytes to the network than it cost to tell all of our users about them.":
Using Stem on my latest copy of the consensus to run some calculations on the relay advertised bandwidth (RAB), I get:
Average RAB: 3887.222911227154 KB/s Median RAB: 249.5 KB/s Combined RABs of all RABs < 249.5KB/s: 162354 KB Bandwidth used for directory requests [1]: ~125 MB/s Current total bandwidth usage [2]: ~5700 MB/s
Meaning that, if we cut off all relays below the current median of 250KB/s, we lose 3064 relays, and lose 158 MB/s of network throughput.
Currently, 2.2% of our bandwidth usage goes toward directory requests (125MB/s / 5700MB/s). If we cut off the relays under 250 KB/s, we cut that 2.2% to 1.1%, saving roughly 75 MB/s in directory requests.
Overall, this means that we can halve the size of the current consensus and, rather than losing 158 MB/s, we only actually lose 83 MB/s in throughput. We could easily play with these numbers a bit, and find a "sweet spot" where the bandwidth cutoff rate is determined by whatever makes us net a positive change in overall bandwidth, taking directory requests into account. In other words: "If your relay costs us more to tell users about than the actual traffic it's providing, we don't want it!"
Long term, I don't think we want to do "only 3000 relays are allowed at any given time", but instead, a compromise where:
2.a. Have a sliding definition of what a "real internet connection" is, by modifying the statistics above to find the "sweet spot", and set this as the cutoff rate for the required minimum bandwidth for being a relay.
2.b. The sliding minimum bandwidth for running a relay is *actually* enforced. If you're below the minimum, no one's going to stop you from running your relay, but it's not going to be in the consensus.
Result: Overall network bandwidth stays the same. The size of the current consensus is roughly chopped in half.
Also, BridgeDB doesn't want your slow relays as bridges. See Footnote [3].
Switching to ECC keys only.
We're wasting a lot of directory traffic on uncompressible RSA1024 keys, which are 4X larger than ECC keys, and less secure. Right now, were also listing both. When we finally remove RSA1024 entirely, the directory should get quite a bit smaller.
Result: ~2-4X reduction in consensus and directory size.
I'm going to ignore microdescriptors for now, because I don't use them because they're a Bad Idea (see #5968). And I'm too lazy to go fetch some of them. :)
Mike, you said:
were [sic] also listing both
Should we assume, then, that you're only talking about the `onion-key`s, but not the `signing-keys`s (which are also currently 1024-bit RSA)?
Also... removing `onion-key`s from the `@type server-descriptor`s would not result in a "~2-4X reduction in [...] directory size". (It might possibly for the cached-microdescriptors, but I'm still ignoring those.)
Taking for example a really small server-descriptor (I removed the contact line and did things like making the bandwidth numbers as small as possible), and one of the largest server descriptors I could find, then making copies of each of these descriptors without the `onion-key`s, and then compressing each one of the four files with `gzip -n -9 $FILE`, I got:
Small server-descriptor, with onion key, compressed: 905 B Small server-descriptor, without onion key, compressed: 756 B Large server-descriptor, with onion key, compressed: 1127 B Large server-descriptor, without onion key, compressed: 980 B
Meaning that, without factoring in potential savings from gzipping multiple descriptors at a time, cutting out `onion-key`s would result in server-descriptors which are only 84% - 87% of the size. 13% savings isn't all that much.
Plus, if you are proposing moving everything (including the `signing-key`s) to ECC, I'm not convinced yet that that is a good idea, especially if we're using only one curve. Putting all your eggs in one basket...
Consensus diffs.
With proposal 140, we can save 60% of the directory activity if we send diffs of the consensus for regularly connecting clients. Calculating the benefit from this is complicated, since if clients leave the network for just 16 hours, there is very little benefit to this optimization. These numbers are highly dependent on churn though, and it may be that by removing most of the slow junk relays, there is actually less churn in the network, and smaller diffs: https://gitweb.torproject.org/torspec.git/blob/HEAD:/proposals/140-consensus...
Let's just ballpark it at 50% for the typical case.
Result: 2X reduction in directory size.
Not to mention that, by reducing the bytes used in directory fetches, consensus diffs also help by increasing the "sweet spot" in #2, and ergo raise the number of relays which the network can sustainably maintain.
Invest in the Tor network.
Based purely on extrapolating from the Noisebridge relays, we could add ~300 relays, and double the network capacity for $3M/yr, or about $1 per user per year (based on the user counts from: https://metrics.torproject.org/users.html).
Note that this value should be treated as a minimum estimate. We actually want to ensure diversity as we grow the network, which may make this number higher. I am working on better estimates using replies from: https://lists.torproject.org/pipermail/tor-relays/2014-September/005335.html
Automated donation/funding distribution mechanisms such as https://www.oniontip.com/ are especially interesting ways to do this (and can even automatically enforce our diversity goals) but more traditional partnerships are also possible.
Result: 100% capacity increase for each O($3M/yr), or ~$1 per new user per year.