Pardon the thread necromancy, but I'm wondering if this document ever made its way off this mailing list and onto a blog? Or perhaps there is some other modern doc covering this topic?
I've recently setup a relay on a Gb/s fiber connection, and am struggling to understand how to optimize performance. It's not clear 5 years later which if any of the tweaks listed below are still relevant. I'm running a modern debian-based system.
Thanks in advance.
After talking to Moritz and Olaf privately and asking them about their nodes, and after running some experiments with some high capacity relays, I've begun to realize that running a fast Tor relay is a pretty black art, with a lot of ad-hoc practice. Only a few people know how to do it, and if you just use Linux and Tor out of the box, your relay will likely underperform on 100Mbit links and above. In the interest of trying to help grow and distribute the network, my ultimate plan is to try to collect all of this lore, use Science to divine out what actually matters, and then write a more succinct blog post about it. However, that is a lot of work. It's also not totally necessary to do all this work, when you can get a pretty good setup with a rough superset of all of the ad-hoc voodoo. This post is thus about that voodoo. Hopefully others will spring forth from the darkness to dump their own voodoo in this thread, as I suspect there is one hell of a lot of it out there, some (much?) of which I don't yet know. Likewise, if any blasphemous heretic wishes to apply Science to this voodoo, they should yell out, "Stand Back, I'm Doing Science!" (at home please, not on this list) and run some experiments to try to eliminate options that are useless to Tor performance. Or cite academic research papers. (But that's not Science, that's computerscience - which is a religion like voodoo, but with cathedrals). Anyway, on with the draft:
== Machine Specs == First, you want to run your OS in x64 mode because openssl should do crypto faster in 64bit. Tor is currently not fully multithreaded, and tends not to benefit beyond 2 cores per process. Even then, the benefit is still marginal beyond just 1 core. 64bit Tor nodes require about one 2Ghz Xeon/Core2 core per 100Mbit of capacity. Thus, to fill an 800Mbit link, you need at least a dual socket, quad core cpu config. You may be able to squeeze a full gigabit out of one of these machines. As far as I know, no one has ever done this with Tor, on any one machine. The i7's also just came out in this form factor, and can do hyperthreading (previous models may list 'ht' in cpuinfo, but actually don't support it). This should give you a decent bonus if you set NumCPUs to 2, since ht tends to work better with pure integer math (like crypto). We have not benchmarked this config yet though, but I suspect it should fill a gigabit link fairly easily, possibly approaching 2Gbit. At full capacity, exit node Tor processes running at this rate consume about 500M of ram. You want to ensure your ram speed is sufficient, but most newish hardware is good. Using on this chart:
https://secure.wikimedia.org/wikipedia/en/wiki/List_of_device_bandwidths#Mem... you can do the math and see that with a dozen memcpys in each direction, you come out needing DDR2 to be able to push 1Gbit full duplex. As far as ethernet cards, the Intel e1000e *should* be theoretically good, but they seem to fail at properly irq balancing across multiple CPUs on recent kernels, which can cause you to bottleneck at 100% CPU on one core. At least that has been Moritz's experience. In our experiments, the RTL-8169 works fine (once tweaked, see below).
== System Tweakscript Wibbles and Config Splatters == First, you want to ensure that you run no more than 2 Tor instances per IP. Any more than this and clients will ignore them. Next, paste the following smattering into the shell (or just read it and make your own script): # Set the hard limit of open file descriptors really high. # Tor will also potentially run out of ports. ulimit -SHn 65000 # Set the txqueuelen high, to prevent premature drops ifconfig eth0 txqueuelen 20000 # Tell our ethernet card (interrupt found from /proc/interrupts) # to balance its IRQs across one whole CPU socket (4 cpus, mask 0f). # You only want one socket for optimal ISR and buffer caching. # # Note that e1000e does NOT seem to obey this, but RTL-8169 will. echo 0f > /proc/irq/17/smp_affinity # Make sure you have auxiliary nameservers. I've seen many ISP # nameservers fall over under load from fast tor nodes, both on our # nodes and from scans. Or run caching named and closely monitor it. echo "nameserver 8.8.8.8" >> /etc/resolv.conf echo "nameserver 4.2.2.2" >> /etc/resolv.conf # Load an amalgam of gigabit-tuning sysctls from: # http://datatag.web.cern.ch/datatag/howto/tcp.html # http://fasterdata.es.net/TCP-tuning/linux.html # http://www.acc.umu.se/~maswan/linux-netperf.txt # http://www.psc.edu/networking/projects/tcptune/#Linux # and elsewhere... # We have no idea which of these are needed yet for our actual use # case, but they do help (especially the nf-contrack ones): sysctl -p << EOF net.ipv4.tcp_rmem = 4096 87380 16777216 net.ipv4.tcp_wmem = 4096 65536 16777216 net.core.netdev_max_backlog = 2500 net.ipv4.tcp_no_metrics_save = 1 net.ipv4.tcp_moderate_rcvbuf = 1 net.core.rmem_max = 1048575 net.core.wmem_max = 1048575 net.ipv4.ip_local_port_range = 1025 61000 net.ipv4.tcp_synack_retries = 3 net.ipv4.tcp_tw_recycle = 1 net.ipv4.tcp_max_syn_backlog = 10240 net.ipv4.tcp_fin_timeout = 30 net.ipv4.tcp_keepalive_time = 1200 net.netfilter.nf_conntrack_tcp_timeout_established=7200 net.netfilter.nf_conntrack_checksum=0 net.netfilter.nf_conntrack_max=131072 net.netfilter.nf_conntrack_tcp_timeout_syn_sent=15 net.ipv4.tcp_keepalive_time=60 net.ipv4.tcp_keepalive_time = 60 net.ipv4.tcp_keepalive_intvl = 10 net.ipv4.tcp_keepalive_probes = 3 net.ipv4.ip_local_port_range = 1025 65530 net.core.netdev_max_backlog=300000 net.core.somaxconn=20480 net.ipv4.tcp_max_tw_buckets=2000000 net.ipv4.tcp_timestamps=0 vm.min_free_kbytes = 65536 net.ipv4.ip_forward=1 net.ipv4.tcp_syncookies = 1 net.ipv4.tcp_synack_retries = 2 net.ipv4.conf.default.forwarding=1 net.ipv4.conf.default.proxy_arp = 1 net.ipv4.conf.all.rp_filter = 1 net.ipv4.conf.default.send_redirects = 1 net.ipv4.conf.all.send_redirects = 0 EOF # XXX: ethtool wibbles # You may also have to tweak some parameters with ethtool, possibly # also enabling some checksum offloading or irq coalescing options to # spare CPU, but for us this hasn't been needed yet.
== Setting up the Torrc == Basically you can just read through the stock example torrc, but there are some as-yet undocumented magic options, and options that need new defaults. # NumCPUs doesn't provide any benefit beyond 2, and setting it higher # may cause cache misses. NumCPUs 2 # These options have archaic maximums of 2-5Mbyte BandwidthRate 100 MB BandwidthBurst 200 MB
== Waiting for the Bootstrap and Measurement Process == Perhaps the most frustrating part of this setup is how long it takes for you to acquire traffic. If you are starting new at an ISP, I would consider only 200-400Mbit for your first month. Hitting that by the end of the month may be a challenge, mostly because their may be dips and total setbacks along the way. The slow rampup is primarily due to limitations in Tor's ability to rapidly publish descriptor updates, and to measure relays. It ends up taking about 2-3 days to hit an observed bandwidth of 2Mbyte/sec per relay, but it can take well over a week or more (Moritz, do you have a better number?) to reach 8-9Mbyte/relay. This is for an Exit node. A middle node will likely gather traffic slower. Also once you crash, you lose it. This bug is about that issue: https://trac.torproject.org/projects/tor/ticket/1863 There is also a potential dip when you get the Guard flag, as our load balancing formulas try to avoid you, but no clients have chosen you yet. Changes to the authority voting on Guards in Tor 0.2.2.15 should make this less drastic. It is also possible that your observed bandwidth will be greater than without it. However, it will still take up to 2 months for clients to choose you as their new Guard.
== Running temporary auxiliary nodes == One way to shortcut this process and avoid paying for bandwidth you don't use is to spin up a bunch of temporary nodes to utilize the CPU and quickly gather that easy first 2MB/sec of observed bandwidth. But you need the spare IPs to do this.
== Monitoring == Personally, I prefer console-based options like nload, top, and Damian's arm (http://www.atagar.com/arm/) because I don't like the idea of running extra services to publish my monitoring data to the world. Other people have web-based monitoring using things like munin and mrtg. It would be nice to get a script/howto for that too.
== Current 1-Box Capacity Record == Our setup has topped at 450Mbit, but averages between 300-400Mbit. We are currently having uptime issues due to heat (melting, poorly ventilated harddrives). It is likely that once we resolve this, we will continually increase to our CPU ceiling. I believe Moritz and Olaf also push this much capacity, possibly a bit more, but with less nodes (4 as opposed to our 8). I hear Jake is also ramping up some Guard nodes (or maybe I didn't? Did I just betray you again Jake?)
== Did I leave anything out? == Well, did I?
-- Mike Perry Mad Computer Scientist fscked.org evil labs