[tor-dev] Scaling tor for a global population

Mon Sep 29 22:01:11 UTC 2014

Mike Perry transcribed 9.3K bytes:
> Andrew Lewman:
> > I had a conversation with a vendor yesterday. They are
> > interested in including Tor as their "private browsing mode" and
> > basically shipping a re-branded tor browser which lets people toggle the
> > connectivity to the Tor network on and off.
> > 
> > They very much like Tor Browser and would like to ship it to their
> > customer base. Their product is 10-20% of the global market, this is of
> > roughly 2.8 billion global Internet users.
> > 
> > As Tor Browser is open source, they are already working on it. However
> > ,their concern is scaling up to handling some percent of global users
> > with "tor mode" enabled. They're willing to entertain offering their
> > resources to help us solve the scalability challenges of handling
> > hundreds of millions of users and relays on Tor.
> > 
> > As this question keeps popping up by the business world looking at
> > privacy as the next "must have" feature in their products, I'm trying to
> > compile a list of tasks to solve to help us scale. The old 2008
> > three-year roadmap looks at performance,
> > https://www.torproject.org/press/2008-12-19-roadmap-press-release.html.en
> > 
> > I've been through the specs,
> > https://gitweb.torproject.org/torspec.git/tree/HEAD:/proposals to see if
> > there are proposals for scaling the network or directory authorities. I
> > didn't see anything directly related.
> > 
> > The last research paper I see directly addressing scalability is Torsk
> > (http://www.freehaven.net/anonbib/bibtex.html#ccs09-torsk) or PIR-Tor
> > (http://www.freehaven.net/anonbib/bibtex.html#usenix11-pirtor)
> 
> These research papers basically propose a total network overhaul to deal
> with the problem of Tor relay directory traffic overwhelming the Tor
> network and/or Tor clients. 
> 
> However, I believe that with only minor modifications, the current Tor
> network architecture could support 100M daily directly connecting users,
> assuming we focus our efforts on higher capacity relays and not simply
> adding tons of slower relays.
> 
> 
> The core problem is that the fraction of network capacity that you spend
> telling users about the current relays in the network can be written as:
> 
>   f = D*U/B
> 
> D is current Tor relay directory size in bytes per day, U is number of
> users, and B is the bandwidth per day in bytes provided by this Tor
> network. Of course, this is a simplification, because of multiple
> directory fetches per day and partially-connecting/idle clients, but for
> purposes of discussion it is good enough.
> 
> To put some real numbers on this, if you compare
> https://metrics.torproject.org/bandwidth.html#dirbytes with
> https://metrics.torproject.org/bandwidth.html#bandwidth, you can see
> that we're currently devoting about 2% of our network throughput to
> directory activity (~120MiB/sec out of ~5000MiB/sec). So we're not
> exactly hurting at this point in terms of our directory bytes per user
> yet.
> 
> But, because this is fraction rises with both D and U, these research
> papers rightly point out that you can't keep adding relays *and* users
> and expect Tor to scale.
> 
> However, when you look at this f=D*U/B formula, what it also says is
> that if you can reduce the relay directory size by a factor c, and also
> grow the network capacity by this same factor c, then you can multiply
> the userbase by c, and have the same fraction of directory bytes.
> 
> This means that rather than trying to undertake a major network overhaul
> like TorSK or PIR-Tor to try to support hundreds of thousands of slow
> junky relays, we can scale the network by focusing on improving the
> situation for high capacity relay operators, so we can provide more
> network bandwidth for the same number of directory bytes per user.
> 
> 
> So, let's look at ways to reduce the size of the Tor relay directory, and
> each way we can find to do so means a corresponding increase in the
> number of users we can support:
> 
> 1. Proper multicore support.
> 
>    Right now, any relay with more than ~100Mbit of capacity really
>    needs to run an additional tor relay instance on that link to make
>    use of it. If they have AES-NI, this might go up to 300Mbit.
>    
>    Each of these additional instances is basically wasted directory
>    bytes for those relay descriptors.
> 
>    But with proper multicore support, such high capacity relays could
>    run only one relay instance on links as fast as 2.5Gbit (assuming an 8
>    core AES-NI machine).
> 
>    Result: 2-8X reduction in consensus and directory size, depending
>    on the the number of high capacity relays on multicore systems we
>    have.
> 
> 2. Cut off relays below the median capacity, and turn them into bridges. 
> 
>    Relays in the top 10% of the network are 164 times faster than
>    relays in the 50-60% range, 1400 times faster than relays in the 
>    70-80% range, and 35000 times faster than relays in the 90-100% range.
> 
>    In fact, many relays are so slow that they provide less bytes to the
>    network than it costs to tell all of our users about them. There
>    should be a sweet spot where we can set this cutoff such that the
>    overhead from directory activity balances the loss of capacity from
>    these relays, as a function of userbase size.
> 
>    Result: ~2X reduction in consensus and directory size.
> 

It's super frustrating when I publicly tell people that ― as much as we <3
them for running a relay ― doing so on a home connection, on wimpy hardware
like Raspberry Pis, is likely only going to harm the Tor network. And then
people point at "If you have at least 100 kilobytes/s each way, please help
out Tor by configuring your Tor to be a relay" on our website [0] and stop
listening to whatever other relay-running advice I have to give.

So... here's the background on the "sweet spot" Mike was talking about, and
why he stated: "[...]many relays are so slow that they provide less bytes to
the network than it cost to tell all of our users about them.":

Using Stem on my latest copy of the consensus to run some calculations on the
relay advertised bandwidth (RAB), I get:

  Average RAB:                               3887.222911227154 KB/s
  Median RAB:                                249.5 KB/s
  Combined RABs of all RABs < 249.5KB/s:     162354 KB
  Bandwidth used for directory requests [1]: ~125 MB/s
  Current total bandwidth usage [2]:         ~5700 MB/s

Meaning that, if we cut off all relays below the current median of 250KB/s, we
lose 3064 relays, and lose 158 MB/s of network throughput.

Currently, 2.2% of our bandwidth usage goes toward directory requests
(125MB/s / 5700MB/s). If we cut off the relays under 250 KB/s, we cut that
2.2% to 1.1%, saving roughly 75 MB/s in directory requests.

Overall, this means that we can halve the size of the current consensus and,
rather than losing 158 MB/s, we only actually lose 83 MB/s in throughput. We
could easily play with these numbers a bit, and find a "sweet spot" where the
bandwidth cutoff rate is determined by whatever makes us net a positive change
in overall bandwidth, taking directory requests into account. In other words:
"If your relay costs us more to tell users about than the actual traffic it's
providing, we don't want it!"

Long term, I don't think we want to do "only 3000 relays are allowed at any
given time", but instead, a compromise where:

 2.a. Have a sliding definition of what a "real internet connection" is, by
      modifying the statistics above to find the "sweet spot", and set this as
      the cutoff rate for the required minimum bandwidth for being a relay.

 2.b. The sliding minimum bandwidth for running a relay is *actually*
      enforced. If you're below the minimum, no one's going to stop you from
      running your relay, but it's not going to be in the consensus.

 Result: Overall network bandwidth stays the same. The size of the current
         consensus is roughly chopped in half.

Also, BridgeDB doesn't want your slow relays as bridges. See Footnote [3].

> 3. Switching to ECC keys only.
> 
>    We're wasting a lot of directory traffic on uncompressible RSA1024
>    keys, which are 4X larger than ECC keys, and less secure. Right now,
>    were also listing both. When we finally remove RSA1024 entirely, the
>    directory should get quite a bit smaller.
> 
>    Result: ~2-4X reduction in consensus and directory size.

I'm going to ignore microdescriptors for now, because I don't use them because
they're a Bad Idea (see #5968). And I'm too lazy to go fetch some of them. :)

Mike, you said:

>    were [sic] also listing both

Should we assume, then, that you're only talking about the `onion-key`s, but
not the `signing-keys`s (which are also currently 1024-bit RSA)?

Also... removing `onion-key`s from the `@type server-descriptor`s would not
result in a "~2-4X reduction in [...] directory size". (It might possibly for
the cached-microdescriptors, but I'm still ignoring those.)

Taking for example a really small server-descriptor (I removed the contact
line and did things like making the bandwidth numbers as small as possible),
and one of the largest server descriptors I could find, then making copies of
each of these descriptors without the `onion-key`s, and then compressing each
one of the four files with `gzip -n -9 $FILE`, I got:

    Small server-descriptor, with onion key, compressed:     905 B
    Small server-descriptor, without onion key, compressed:  756 B
    Large server-descriptor, with onion key, compressed:    1127 B    
    Large server-descriptor, without onion key, compressed:  980 B    

Meaning that, without factoring in potential savings from gzipping multiple
descriptors at a time, cutting out `onion-key`s would result in
server-descriptors which are only 84% - 87% of the size. 13% savings isn't
all that much.

Plus, if you are proposing moving everything (including the `signing-key`s) to
ECC, I'm not convinced yet that that is a good idea, especially if we're using
only one curve. Putting all your eggs in one basket...

> 4. Consensus diffs.
> 
>    With proposal 140, we can save 60% of the directory activity if
>    we send diffs of the consensus for regularly connecting clients.
>    Calculating the benefit from this is complicated, since if clients
>    leave the network for just 16 hours, there is very little benefit
>    to this optimization. These numbers are highly dependent on churn
>    though, and it may be that by removing most of the slow junk relays,
>    there is actually less churn in the network, and smaller diffs:
>    https://gitweb.torproject.org/torspec.git/blob/HEAD:/proposals/140-consensus-diffs.txt
> 
>    Let's just ballpark it at 50% for the typical case.
> 
>    Result: 2X reduction in directory size.
> 

Not to mention that, by reducing the bytes used in directory fetches,
consensus diffs also help by increasing the "sweet spot" in #2, and ergo raise
the number of relays which the network can sustainably maintain.

> 5. Invest in the Tor network.
> 
>    Based purely on extrapolating from the Noisebridge relays, we could
>    add ~300 relays, and double the network capacity for $3M/yr, or about $1
>    per user per year (based on the user counts from:
>    https://metrics.torproject.org/users.html).
> 
>    Note that this value should be treated as a minimum estimate. We
>    actually want to ensure diversity as we grow the network, which may make
>    this number higher. I am working on better estimates using replies from: 
>    https://lists.torproject.org/pipermail/tor-relays/2014-September/005335.html
> 
>    Automated donation/funding distribution mechanisms such as
>    https://www.oniontip.com/ are especially interesting ways to do this
>    (and can even automatically enforce our diversity goals) but more
>    traditional partnerships are also possible.
> 
>    Result: 100% capacity increase for each O($3M/yr), or ~$1 per new user
>            per year.
> 

-- 
 ♥Ⓐ isis agora lovecruft
_________________________________________________________
GPG: 4096R/A3ADB67A2CDB8B35
Current Keys: https://blog.patternsinthevoid.net/isis.txt
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 1154 bytes
Desc: Digital signature
URL: <http://lists.torproject.org/pipermail/tor-dev/attachments/20140929/430d70f8/attachment.sig>