[tor-dev] Scaling tor for a global population
isis at torproject.org
Mon Sep 29 22:01:11 UTC 2014
Mike Perry transcribed 9.3K bytes:
> Andrew Lewman:
> > I had a conversation with a vendor yesterday. They are
> > interested in including Tor as their "private browsing mode" and
> > basically shipping a re-branded tor browser which lets people toggle the
> > connectivity to the Tor network on and off.
> > They very much like Tor Browser and would like to ship it to their
> > customer base. Their product is 10-20% of the global market, this is of
> > roughly 2.8 billion global Internet users.
> > As Tor Browser is open source, they are already working on it. However
> > ,their concern is scaling up to handling some percent of global users
> > with "tor mode" enabled. They're willing to entertain offering their
> > resources to help us solve the scalability challenges of handling
> > hundreds of millions of users and relays on Tor.
> > As this question keeps popping up by the business world looking at
> > privacy as the next "must have" feature in their products, I'm trying to
> > compile a list of tasks to solve to help us scale. The old 2008
> > three-year roadmap looks at performance,
> > https://www.torproject.org/press/2008-12-19-roadmap-press-release.html.en
> > I've been through the specs,
> > https://gitweb.torproject.org/torspec.git/tree/HEAD:/proposals to see if
> > there are proposals for scaling the network or directory authorities. I
> > didn't see anything directly related.
> > The last research paper I see directly addressing scalability is Torsk
> > (http://www.freehaven.net/anonbib/bibtex.html#ccs09-torsk) or PIR-Tor
> > (http://www.freehaven.net/anonbib/bibtex.html#usenix11-pirtor)
> These research papers basically propose a total network overhaul to deal
> with the problem of Tor relay directory traffic overwhelming the Tor
> network and/or Tor clients.
> However, I believe that with only minor modifications, the current Tor
> network architecture could support 100M daily directly connecting users,
> assuming we focus our efforts on higher capacity relays and not simply
> adding tons of slower relays.
> The core problem is that the fraction of network capacity that you spend
> telling users about the current relays in the network can be written as:
> f = D*U/B
> D is current Tor relay directory size in bytes per day, U is number of
> users, and B is the bandwidth per day in bytes provided by this Tor
> network. Of course, this is a simplification, because of multiple
> directory fetches per day and partially-connecting/idle clients, but for
> purposes of discussion it is good enough.
> To put some real numbers on this, if you compare
> https://metrics.torproject.org/bandwidth.html#dirbytes with
> https://metrics.torproject.org/bandwidth.html#bandwidth, you can see
> that we're currently devoting about 2% of our network throughput to
> directory activity (~120MiB/sec out of ~5000MiB/sec). So we're not
> exactly hurting at this point in terms of our directory bytes per user
> But, because this is fraction rises with both D and U, these research
> papers rightly point out that you can't keep adding relays *and* users
> and expect Tor to scale.
> However, when you look at this f=D*U/B formula, what it also says is
> that if you can reduce the relay directory size by a factor c, and also
> grow the network capacity by this same factor c, then you can multiply
> the userbase by c, and have the same fraction of directory bytes.
> This means that rather than trying to undertake a major network overhaul
> like TorSK or PIR-Tor to try to support hundreds of thousands of slow
> junky relays, we can scale the network by focusing on improving the
> situation for high capacity relay operators, so we can provide more
> network bandwidth for the same number of directory bytes per user.
> So, let's look at ways to reduce the size of the Tor relay directory, and
> each way we can find to do so means a corresponding increase in the
> number of users we can support:
> 1. Proper multicore support.
> Right now, any relay with more than ~100Mbit of capacity really
> needs to run an additional tor relay instance on that link to make
> use of it. If they have AES-NI, this might go up to 300Mbit.
> Each of these additional instances is basically wasted directory
> bytes for those relay descriptors.
> But with proper multicore support, such high capacity relays could
> run only one relay instance on links as fast as 2.5Gbit (assuming an 8
> core AES-NI machine).
> Result: 2-8X reduction in consensus and directory size, depending
> on the the number of high capacity relays on multicore systems we
> 2. Cut off relays below the median capacity, and turn them into bridges.
> Relays in the top 10% of the network are 164 times faster than
> relays in the 50-60% range, 1400 times faster than relays in the
> 70-80% range, and 35000 times faster than relays in the 90-100% range.
> In fact, many relays are so slow that they provide less bytes to the
> network than it costs to tell all of our users about them. There
> should be a sweet spot where we can set this cutoff such that the
> overhead from directory activity balances the loss of capacity from
> these relays, as a function of userbase size.
> Result: ~2X reduction in consensus and directory size.
It's super frustrating when I publicly tell people that ― as much as we <3
them for running a relay ― doing so on a home connection, on wimpy hardware
like Raspberry Pis, is likely only going to harm the Tor network. And then
people point at "If you have at least 100 kilobytes/s each way, please help
out Tor by configuring your Tor to be a relay" on our website  and stop
listening to whatever other relay-running advice I have to give.
So... here's the background on the "sweet spot" Mike was talking about, and
why he stated: "[...]many relays are so slow that they provide less bytes to
the network than it cost to tell all of our users about them.":
Using Stem on my latest copy of the consensus to run some calculations on the
relay advertised bandwidth (RAB), I get:
Average RAB: 3887.222911227154 KB/s
Median RAB: 249.5 KB/s
Combined RABs of all RABs < 249.5KB/s: 162354 KB
Bandwidth used for directory requests : ~125 MB/s
Current total bandwidth usage : ~5700 MB/s
Meaning that, if we cut off all relays below the current median of 250KB/s, we
lose 3064 relays, and lose 158 MB/s of network throughput.
Currently, 2.2% of our bandwidth usage goes toward directory requests
(125MB/s / 5700MB/s). If we cut off the relays under 250 KB/s, we cut that
2.2% to 1.1%, saving roughly 75 MB/s in directory requests.
Overall, this means that we can halve the size of the current consensus and,
rather than losing 158 MB/s, we only actually lose 83 MB/s in throughput. We
could easily play with these numbers a bit, and find a "sweet spot" where the
bandwidth cutoff rate is determined by whatever makes us net a positive change
in overall bandwidth, taking directory requests into account. In other words:
"If your relay costs us more to tell users about than the actual traffic it's
providing, we don't want it!"
Long term, I don't think we want to do "only 3000 relays are allowed at any
given time", but instead, a compromise where:
2.a. Have a sliding definition of what a "real internet connection" is, by
modifying the statistics above to find the "sweet spot", and set this as
the cutoff rate for the required minimum bandwidth for being a relay.
2.b. The sliding minimum bandwidth for running a relay is *actually*
enforced. If you're below the minimum, no one's going to stop you from
running your relay, but it's not going to be in the consensus.
Result: Overall network bandwidth stays the same. The size of the current
consensus is roughly chopped in half.
Also, BridgeDB doesn't want your slow relays as bridges. See Footnote .
> 3. Switching to ECC keys only.
> We're wasting a lot of directory traffic on uncompressible RSA1024
> keys, which are 4X larger than ECC keys, and less secure. Right now,
> were also listing both. When we finally remove RSA1024 entirely, the
> directory should get quite a bit smaller.
> Result: ~2-4X reduction in consensus and directory size.
I'm going to ignore microdescriptors for now, because I don't use them because
they're a Bad Idea (see #5968). And I'm too lazy to go fetch some of them. :)
Mike, you said:
> were [sic] also listing both
Should we assume, then, that you're only talking about the `onion-key`s, but
not the `signing-keys`s (which are also currently 1024-bit RSA)?
Also... removing `onion-key`s from the `@type server-descriptor`s would not
result in a "~2-4X reduction in [...] directory size". (It might possibly for
the cached-microdescriptors, but I'm still ignoring those.)
Taking for example a really small server-descriptor (I removed the contact
line and did things like making the bandwidth numbers as small as possible),
and one of the largest server descriptors I could find, then making copies of
each of these descriptors without the `onion-key`s, and then compressing each
one of the four files with `gzip -n -9 $FILE`, I got:
Small server-descriptor, with onion key, compressed: 905 B
Small server-descriptor, without onion key, compressed: 756 B
Large server-descriptor, with onion key, compressed: 1127 B
Large server-descriptor, without onion key, compressed: 980 B
Meaning that, without factoring in potential savings from gzipping multiple
descriptors at a time, cutting out `onion-key`s would result in
server-descriptors which are only 84% - 87% of the size. 13% savings isn't
all that much.
Plus, if you are proposing moving everything (including the `signing-key`s) to
ECC, I'm not convinced yet that that is a good idea, especially if we're using
only one curve. Putting all your eggs in one basket...
> 4. Consensus diffs.
> With proposal 140, we can save 60% of the directory activity if
> we send diffs of the consensus for regularly connecting clients.
> Calculating the benefit from this is complicated, since if clients
> leave the network for just 16 hours, there is very little benefit
> to this optimization. These numbers are highly dependent on churn
> though, and it may be that by removing most of the slow junk relays,
> there is actually less churn in the network, and smaller diffs:
> Let's just ballpark it at 50% for the typical case.
> Result: 2X reduction in directory size.
Not to mention that, by reducing the bytes used in directory fetches,
consensus diffs also help by increasing the "sweet spot" in #2, and ergo raise
the number of relays which the network can sustainably maintain.
> 5. Invest in the Tor network.
> Based purely on extrapolating from the Noisebridge relays, we could
> add ~300 relays, and double the network capacity for $3M/yr, or about $1
> per user per year (based on the user counts from:
> Note that this value should be treated as a minimum estimate. We
> actually want to ensure diversity as we grow the network, which may make
> this number higher. I am working on better estimates using replies from:
> Automated donation/funding distribution mechanisms such as
> https://www.oniontip.com/ are especially interesting ways to do this
> (and can even automatically enforce our diversity goals) but more
> traditional partnerships are also possible.
> Result: 100% capacity increase for each O($3M/yr), or ~$1 per new user
> per year.
♥Ⓐ isis agora lovecruft
Current Keys: https://blog.patternsinthevoid.net/isis.txt
-------------- next part --------------
A non-text attachment was scrubbed...
Size: 1154 bytes
Desc: Digital signature
More information about the tor-dev