TLDR: what do people to do get the max throughput through their boxes?
Hi,
This might be more tor-dev related (due to the Tor internals, eg why it does not use multiple CPU cores effectively etc), but is likely a bit more appropriate here as there are people who are able to get a lot of performance out of their boxes.
I've been playing a bit with setting up a few relays and letting them push as much traffic as possible following amongst others the items at: https://www.torservers.net/wiki/setup/server
Thus making sure it is using AES-NI (which turned off and then on made a bit of a difference but primarily in CPU load), and doing some TCP stack and other kernel tweaks.
I am running the current-git Tor on them, thus self-compiled and except for the install path no special configure options (any tips there?).
The boxes are 2-cpu 6-core E5645 @ 2.40Ghz, with HT thus 24 cores visible. Tor is using about 170% CPU (thus effectively 2 cores) on average along with 3G of mem, the box has 70G of mem thus that is not a problem.
A little snapshot from 'arm' from one of the boxes:
Bandwidth (limit: 3.9 Gb/s, burst: 3.9 Gb/s, measured: 353.9 Kb/s) Download (45.7 Mb/sec - avg: 27.2 Mb/sec, total: 302.2 GB) Upload (52.5 Mb/sec - avg: 27.9 Mb/sec, total: 308.4 GB):
Down/Up varies upto 70mbit, the box has full GE and between them can push easily a single-stream 900mbit flow (tested with iperf/wgets/scp) next to the running Tor process. Thus there seem to be some significant issue in the Tor portion of things (though tuning might affect it as there are more flows etc). There is no connection tracking on the box, as that would just slow things down
See also the munged torrc below, in case there are options to be set there.
What else is there to tune except for maybe running multiple Tor nodes on the same box? Which would require it to use multiple IPs right as one can only run 2 nodes on the same IP I understand.
Would there maybe be a way to run multiple Tor processes with the same key/identity but with a TCP load-balancer in front of it which distributes the incoming connections to the processes? The only thing then is that only one of them should report their details to the authorities and the others should not publish; would that be possible or would it mess up for instance performance stats?
Greets, Jeroen
--
torrc used: ----- NickName <nick> ContactInfo <contact> MyFamily $<othernode>
ControlPort 9051 HashedControlPassword 16:<pass> CookieAuthentication 1
DirPort <ip>:<port> DirPortFrontPage /usr/local/tor/etc/tor/tordirport.html
ORPort <ip>:<port>
RelayBandwidthRate 600 MB RelayBandwidthBurst 606 MB
SocksListenAddress 127.0.0.1 SocksPort 1080
ExitPolicy reject *:*
#Log debug file /usr/local/tor/var/log/tor/debug.log Log notice file /usr/local/tor/var/log/tor/notices.log DataDirectory /usr/local/tor/var/lib/tor
RunAsDaemon 1 DisableDebuggerAttachment 0
CellStatistics 1 DirReqStatistics 1 EntryStatistics 1 ExitPortStatistics 1 ExtraInfoStatistics 1 -----
/etc/sysctl.d/tor.conf net.ipv4.tcp_syncookies=1 net.ipv4.tcp_synack_retries=2 net.ipv4.tcp_syn_retries=2 net.core.rmem_max=33554432 net.core.wmem_max=33554432 net.ipv4.tcp_rmem=4096 87380 33554432 net.ipv4.tcp_wmem=4096 65536 33554432 net.core.netdev_max_backlog=262144 net.ipv4.tcp_no_metrics_save=1 net.ipv4.tcp_moderate_rcvbuf=1 net.ipv4.tcp_tw_recycle=1 net.ipv4.tcp_max_orphans=262144 net.ipv4.tcp_max_syn_backlog=262144 net.ipv4.tcp_fin_timeout=4 vm.min_free_kbytes=65536 net.ipv4.tcp_keepalive_time=60 net.ipv4.tcp_keepalive_intvl=10 net.ipv4.tcp_keepalive_probes=3 net.ipv4.ip_local_port_range=1025 65530 net.core.somaxconn=20480 net.ipv4.tcp_max_tw_buckets=2000000 net.ipv4.tcp_timestamps=0
Hi Jeroen,
On 09/11/2013 02:21 PM, Jeroen Massar wrote:
What else is there to tune except for maybe running multiple Tor nodes on the same box? Which would require it to use multiple IPs right as one can only run 2 nodes on the same IP I understand.
You will start to see error messages in your logs. It is very unlikely that you will be able to satisy the interface with just one Tor process. Best I've seen is 400 MBit/s per Tor process on modern machines with AES-NI.
How long did you leave your relay up and running? Let me quote from Roger's recent great blog post:
"A new relay, assuming it is reliable and has plenty of bandwidth, goes through four phases: the unmeasured phase (days 0-3) where it gets roughly no use, the remote-measurement phase (days 3-8) where load starts to increase, the ramp-up guard phase (days 8-68) where load counterintuitively drops and then rises higher, and the steady-state guard phase (days 68+)."
https://blog.torproject.org/blog/lifecycle-of-a-new-relay
Would there maybe be a way to run multiple Tor processes with the same key/identity but with a TCP load-balancer in front of it which distributes the incoming connections to the processes? The only thing then is that only one of them should report their details to the authorities and the others should not publish; would that be possible or would it mess up for instance performance stats?
I have not tried such a thing, and I don't think anyone else has. It's very easy to run and manage multiple Tor processes, and so far every ISP was able to provide more than one IP.
Great effort, thanks! Please keep it up!
On 2013-09-11 16:48 , Moritz Bartl wrote:
Hi Jeroen,
On 09/11/2013 02:21 PM, Jeroen Massar wrote:
What else is there to tune except for maybe running multiple Tor nodes on the same box? Which would require it to use multiple IPs right as one can only run 2 nodes on the same IP I understand.
You will start to see error messages in your logs.
You mean the endless repeats of:
Sep 10 20:37:42.000 [warn] failed to get unique circID. Sep 10 20:37:43.000 [warn] No unused circ IDs. Failing. Sep 10 20:37:43.000 [warn] failed to get unique circID. Sep 10 20:37:43.000 [warn] No unused circ IDs. Failing. Sep 10 20:37:43.000 [warn] failed to get unique circID.
? :)
Those they hit once in a while, but further it is quite quiet.
It is very unlikely that you will be able to satisy the interface with just one Tor process.
Are we aware why this limit gets reached? Is is because Tor does not use all cores or simply because the instructions needed are maxed out?
Best I've seen is 400 MBit/s per Tor process on modern machines with AES-NI.
50mbit down + up is nowhere there yet.
Are boxes that are doing these speeds running at a CPU or a network cap? Or maybe better asked, they do run at 100% usage of their cores or do they just use two/three cores to the max?
If it is a CPU cap, is it because it is using only one proc or only a few cores, or is it because the AES-NI instruction set is fully loaded, in which case we could have partial requests going through that and others through standard CPU/FPU math.... ? :)
For instance the manning2 node and some others are pushing quite a bit more than 50mbit, what config/setup is that? Or is it really limited to the 68 day thing and that traffic slowly picks up as long as cpu limits and network limits allow?
I would at least expect huge bursts in the mean time from clients that are fast, maybe I have to force a Tor client to use my nodes as multi-hops and do a speed test that way.
Thus: me -> box1 -> box2 -> box3 -> hiddenservice
all under my control so that I know the network properties and am not flooding anything.
How long did you leave your relay up and running? Let me quote from Roger's recent great blog post:
That post is great indeed (saw it earlier today too), the relays are up (that is active, I restart them once in a while for version upgrades) for more than two weeks now, thus they have not made it to the 70 days+ range. They still need to gain 'named' flag too.
But I am wondering if that will add a lot more traffic to the box or not, seeing that two cores are nearly maxed out by Tor.
Btw, Tor has 5G virtual and 2.6G resident memory in use at the moment...
Another nice artifact since the last update to 2.5.x branch is that nTor connections is going up:
Sep 11 04:44:33.000 [notice] Circuit handshake stats since last time: 361258/361258 TAP, 418/418 NTor. Sep 11 05:44:33.000 [notice] Circuit handshake stats since last time: 434515/434515 TAP, 407/407 NTor. Sep 11 06:44:33.000 [notice] Circuit handshake stats since last time: 564997/564997 TAP, 467/467 NTor. Sep 11 07:44:33.000 [notice] Circuit handshake stats since last time: 583556/583556 TAP, 607/607 NTor. Sep 11 08:44:33.000 [notice] Circuit handshake stats since last time: 925657/929808 TAP, 784/784 NTor. Sep 11 09:44:33.000 [notice] Circuit handshake stats since last time: 1394250/1600424 TAP, 923/923 NTor. Sep 11 10:45:22.000 [notice] Circuit handshake stats since last time: 1468992/1487163 TAP, 1117/1117 NTor. Sep 11 10:45:27.000 [notice] Heartbeat: Tor's uptime is 1 day 0:00 hours, with 54130 circuits open. I've sent 287.59 GB and received 281.98 GB. Sep 11 10:45:27.000 [notice] Average packaged cell fullness: 99.055% Sep 11 10:45:27.000 [notice] TLS write overhead: 10% Sep 11 11:44:30.000 [notice] Circuit handshake stats since last time: 1276400/1746616 TAP, 1181/1182 NTor. Sep 11 12:44:30.000 [notice] Circuit handshake stats since last time: 1150424/1808004 TAP, 1275/1275 NTor. Sep 11 13:44:30.000 [notice] Circuit handshake stats since last time: 1095543/1959589 TAP, 1429/1429 NTor. Sep 11 14:44:30.000 [notice] Circuit handshake stats since last time: 1111451/1971801 TAP, 1500/1501 NTor.
Thus traffic it does push and more NTor is coming in!
Would there maybe be a way to run multiple Tor processes with the same key/identity but with a TCP load-balancer in front of it which distributes the incoming connections to the processes? The only thing then is that only one of them should report their details to the authorities and the others should not publish; would that be possible or would it mess up for instance performance stats?
I have not tried such a thing, and I don't think anyone else has.
Seems that if I have some cycles left somewhere next week I'll try to see if I can make something like that work. Especially with a non-AES-NI openssl it might show if we can get extra juice out of a box that way.
I wonder though a bit how the bandwidth measurements work and what data is submitted to the DirAuths, I'll have to read up in the source on that to see if these kind of submits could be merged between processes.
It's very easy to run and manage multiple Tor processes, and so far
every ISP
was able to provide more than one IP.
With gigabit-at-home becoming more common and relays not causing any 'exit' and thus possible abusive/"copyright infringing" annoyance, doing it 'at home' might become doable[1], though I am fairly sure one of those ISPs will slap one with a "fair usage" if one uses it that way ;) But in most of those cases 1 IP is all one will get...
Also, from the view of the MyFamily argument, it is easier and possibly better and clearer for the Tor network to have a single Node than having that Node effectively spread over multiple IPs but actually being the same node.
Greets, Jeroen
[1] though definitely not recommended, as losing your shiny link that way is not what you likely want at home...
On Wed, Sep 11, 2013 at 05:13:04PM +0200, Jeroen Massar wrote:
Are boxes that are doing these speeds running at a CPU or a network cap? Or maybe better asked, they do run at 100% usage of their cores or do they just use two/three cores to the max?
There are three main sinks of CPU usage in a well-configured large Tor relay:
1. doing AES and SHA. This scales with the network bandwidth used. 2. doing Montgomery multiplication for circuit creation requests. 3. bookkeeping.
(4. kernel TCP overhead etc.)
Until the August botnet hit, #1 was the primary user of CPU on most relays. A single Xeon core can do about 150 MB/sec of AES, or with AES-NI around 700 MB/sec.
With the vastly increased circuit creation load currently in progress, #2 and #3 have become a larger problem. The bookkeeping, in particular, has grown significantly. On noisetor right now, 17% of all CPU cycles are being spent in a single bookkeeping routine, circuit_unlink_all_from_channel, according to "perf top".
https://trac.torproject.org/projects/tor/ticket/9683
This increased circuit-create-and-destroy CPU load reduces the CPU available to do useful AES, so as a result, currently many Tor relays are showing increased CPU usage with decreased bandwidth usage.
You'll have trouble getting a single Xeon core to run much more than 300 Mbps even with AES-NI, even without the botnet increasing CPU load without increasing throughput usage. In the current state, with so much extra bignum work and bookkeeping, a single daemon will have even more trouble pushing much bandwidth.
Best practice for maximum bandwidth is to run one Tor daemon per physical core, each on a distinct IP address. Plan for each daemon to push about 15 MByte/sec. They can do more like 20 or 30, but planning for lower leaves some headroom.
Your boxes, with 12 cores and 70 GB of RAM, are quite a bit overpowered for running 500 Mbps of Tor. If you ran a Tor daemon per core, you'd be able to push around 2 Gbps of Tor traffic, easily.
-andy
On 2013-09-12 11:06 , Andy Isaacson wrote:
On Wed, Sep 11, 2013 at 05:13:04PM +0200, Jeroen Massar wrote:
Are boxes that are doing these speeds running at a CPU or a network cap? Or maybe better asked, they do run at 100% usage of their cores or do they just use two/three cores to the max?
There are three main sinks of CPU usage in a well-configured large Tor relay:
- doing AES and SHA. This scales with the network bandwidth used.
- doing Montgomery multiplication for circuit creation requests.
- bookkeeping.
(4. kernel TCP overhead etc.)
[..]
Thanks that explains a lot!
Your boxes, with 12 cores and 70 GB of RAM, are quite a bit overpowered for running 500 Mbps of Tor. If you ran a Tor daemon per core, you'd be able to push around 2 Gbps of Tor traffic, easily.
Awesome, that is good to hear, as then it should be able to fill the Gig-E pipe at least theoretically.
As I am trying to avoid using too many IPs (IPv4 is constrainted, IPv6 is not, but the latter won't get much traffic), I'll try if I can get my tcp-balancer idea setup in the run of next week (low on spare cycles at the moment) and then forcing each Tor instance to use a specific core.
At least, incoming should be easy that way; the question more becomes what outgoing traffic will do, especially the bit that sends details to the authorities, I'll see how that works though ;)
Greets, Jeroen
On 11.09.2013 16:48, Moritz Bartl wrote:
and so far every ISP was able to provide more than one IP.
On a side note:
At least one big hosting provider in Germany (Hetzner) has started to charge money for additional ipv4 addresses. Ipv6 addresses on the other hand are included in sufficient numbers. ;-)
-Stephan
tor-relays@lists.torproject.org