can someone give me hints on what hardware would be best suited to run big fat tor exit nodes connected with multiple 1gbps or 10gps links? We are considering putting some fat boxes near major internet exchanges of the world.
sent from iPhone
On Thu, Jul 11, 2013 at 08:46:20PM +0200, Andreas Fink wrote:
can someone give me hints on what hardware would be best suited to run big fat tor exit nodes connected with multiple 1gbps or 10gps links? We are considering putting some fat boxes near major internet exchanges of the world.
Modern Xeon, AES-NI is helpful, HT is not very helpful (but not hurtful either), higher clock rate is more helpful than more cores. 4GB of RAM per core, you can probably get away with 2GB/core but why skimp. Noisetor uses most of a 4-core X3350 2.6 GHz to push ~500 Mbps symmetric. That's without AES-NI, so I'd expect a quadcore 2.5 GHz AES-NI to be able to fill a 1Gbps pipe.
AMD doesn't seem to make any server CPUs that are useful for this application, unfortunately.
-andy
Andy Isaacson:
On Thu, Jul 11, 2013 at 08:46:20PM +0200, Andreas Fink wrote:
can someone give me hints on what hardware would be best suited to run big fat tor exit nodes connected with multiple 1gbps or 10gps links? We are considering putting some fat boxes near major internet exchanges of the world.
Modern Xeon, AES-NI is helpful, HT is not very helpful (but not hurtful either), higher clock rate is more helpful than more cores. 4GB of RAM per core, you can probably get away with 2GB/core but why skimp. Noisetor uses most of a 4-core X3350 2.6 GHz to push ~500 Mbps symmetric. That's without AES-NI, so I'd expect a quadcore 2.5 GHz AES-NI to be able to fill a 1Gbps pipe.
This sounds right (~100Mbit per CPU core without AES-NI), but it would be good to hear Moritz weigh in here with some additional datapoints for AES-NI. Last I heard, AES-NI gets you ~300Mbit per core, but I have no direct experience myself.
The key thing to know is that Tor is still not great at multithreading. In fact, the torrc option 'NumCPUs' is mostly useless for relays at this scale.
For this reason, you want to run one tor daemon per CPU core, with a max of two per IP, and something like 2-4GB of RAM per daemon like Andy said. That's why we have noiseexit01a-d, Amunet1-8, manning1-2, etc.
You probably also shouldn't run too many of these sized relays by yourself, either. It is generally considered poor form to run too much of the Tor network by yourself until other people can catch up and balance your efforts. I would look for ways to decentralize/delegate once you got beyond a couple gbits or so for this reason. Please feel free to ask the list for suggestions on legal and admin structure for accomplishing this.
On 12.07.2013 08:02, Mike Perry wrote:
This sounds right (~100Mbit per CPU core without AES-NI), but it would be good to hear Moritz weigh in here with some additional datapoints for AES-NI. Last I heard, AES-NI gets you ~300Mbit per core, but I have no direct experience myself.
Yes, 300Mbit/s is my estimate as well. We publish some amount of statistics at https://torservers.net/munin/ (axigy1, axigy2 and voxility1 are on Gbit).
You probably also shouldn't run too many of these sized relays by yourself, either. It is generally considered poor form to run too much of the Tor network by yourself until other people can catch up and balance your efforts. I would look for ways to decentralize/delegate once you got beyond a couple gbits or so for this reason. Please feel free to ask the list for suggestions on legal and admin structure for accomplishing this.
Happy to help! If you have any questions you can also reach me via Jabber (same address as this email address).
AMD doesn't seem to make any server CPUs that are useful for this application, unfortunately.
Really, how so? Many AMD CPU's have AES-NI. Even the A10-6800K (4 x 4.1GHz) would be decent. That plus an a85x mainboard (1Gbit) and 8GB ddr3-2133 is $300. Add some case+ps.
https://en.wikipedia.org/wiki/List_of_AMD_Accelerated_Processing_Unit_microp... https://en.wikipedia.org/wiki/List_of_AMD_FX_microprocessors#.22Vishera.22_.... https://en.wikipedia.org/wiki/List_of_AMD_Opteron_microprocessors#Piledriver...
Be careful, Intel likes to promote HT instead of full cores. There's likely some reviews of HT vs. OS schedulers out there. Intel/Opteron get pricy very quick, check price vs. performance vs. the price delta to get a second AMD node up at 2 x nMbps. A lot of the price may be in die/power and cache, which may not be of concern.
On Fri, Jul 12, 2013 at 06:45:58PM -0400, grarpamp wrote:
AMD doesn't seem to make any server CPUs that are useful for this application, unfortunately.
Really, how so? Many AMD CPU's have AES-NI. Even the A10-6800K (4 x 4.1GHz) would be decent.
That's not a server CPU. It doesn't seem to support ECC, and it doesn't go in boards that are well designed for server applications (with things like serial console BIOS support and 1U form factor).
That plus an a85x mainboard (1Gbit)
That cheap desktop board has a Realtek NIC. Realtek NICs are spectacularly bad for server use cases. We were not able to push 400 Mbps of Tor traffic on a Realtek, possibly due to the r8139 (iirc) chip/driver lacking interrupt coalescing features. Upgrading to an Intel e1000e fixed the problem. Broadcom tg3 also works fine. The newer Broadcom (nx or something?) chips should also be fine.
and 8GB ddr3-2133 is $300.
It's very silly to not specify ECC RAM for a server.
Add some case+ps.
The kind of ISPs that offer competitive pricing on bandwidth tend to prefer commercially integrated servers, preferably sourced from vendors they're familiar with. That way when your server crashes and needs a reboot at 2AM, the tech in the data center doesn't have to puzzle out the buttons and connectors on some utterly random box that you found on a street corner.
Be careful, Intel likes to promote HT instead of full cores.
That's a really funny claim, since exactly the opposite seems true from my point of view. Intel clearly specifies how many cores and how many HTs are provided on each CPU, and a single thread can use nearly all of the resources on a core. HT is useful, on Sandy Bridge, for providing fine grained parallelism to let the CPU get useful work done during cache miss stalls and similar, but HT is not necessary to get full ALU utilization for in-cache codes. AMD Bulldozer, OTOH, claims to have 8 cores, but they come in "bundles" of 2, and the "8 core" Bulldozer has approximately the same number of ALUs and other CPU resources as the 4 core Intel chips. As a result, each individual Bulldozer "core" (really more like a HT on Sandy Bridge) is fairly slow in terms of operations/clock, and AMD's resource scheduler doesn't seem to be very good at dynamic resource allocation.
The end result of this threading nonsense is, on an Intel CPU you can get 90% of the CPU throughput doing useful work with 4 threads, while on a Bulldozer you need 8 threads to get 90% throughput. For Tor that means 8 daemons rather than 4, a significantly higher annoyance.
Making matters worse, Bulldozer has at least a 20% if not more like 30% power penalty versus Sandy Bridge, measuring actual work done per watt on CPU intensive workloads.
That's why I said AMD unfortunately doesn't seem to have a competitive server CPU these days. It's possible that Piledriver improves the situation, but the analysis I saw did not make me optimistic that it would be competitive with Ivy Bridge.
-andy
A10-6800K (4 x 4.1GHz) would be decent.
It doesn't seem to support ECC
It doesn't. And for those that recognize its importance, that's been an kind of weakness of AMD for some time. Actually for both AMD and Intel, it's treated as a price premium instead of just 8+n extra gates and logic.
It's very silly to not specify ECC RAM for a server.
It should be in all systems IMO too, even with crypto over circuits and ZFS on disk. But I not sure the CPU registers, or the rest of the gates on the die, have any such hardware protection beyond the lid on top and the fiberglass on the bottom... haven't looked into it. Or that the makeup of the Internet/services as a whole use ECC. Tor seems more weighted to pushing reproducible bits, than to first production and storage of your own valuable work product.
like serial console BIOS support and
It's nice if your model involves needing to go to BIOS, such as being tied to hardware raid there.
1U form factor
The port stacks can be cut down / removed if need be. Another problem is finding low profile ram to go in non-angled slots. The 'enthusiast' influx to the ram market is annoying yet the unneeded 'heatsinks' (aka bling) can also be removed.
the r8139 (iirc) chip/driver lacking interrupt coalescing features. Upgrading to an Intel e1000e fixed the problem.
Yes, Intel has always had very good nics and supplies docs/code. I think a85x boards commonly have rt8111.
The kind of ISPs that offer competitive pricing on bandwidth tend to prefer commercially integrated servers, preferably sourced from vendors they're familiar with.
That means Intel, Dell, HP and the like. I'd hesitate to tie bandwidth price to what some techs want. It's more about how much the ISP buys, if you're small retail remote hands using or a large self-serve cage, and if it's in a neutral DC.
That way when your server crashes and needs a reboot at 2AM, the tech in the data center doesn't
This is a bit crossed. A true server board should have a hw watchdog that will autoreboot the OS. Similarly, with a good server, pulling the plug should be all a tech needs to do to clear all stuck cases short of hardware failure. You need a good OS/FS for that.
Be careful, Intel likes to promote HT instead of full cores.
That's a really funny claim
Some of their retail marketing I've seen is not so clear, as in 'Here are 8 virtual things, footnote asterisk fine print - really backed by 4 real things'. Their server side online is better.
a single thread can use nearly all of the resources on a core. HT is useful
Sure, if your workload is not weighted to single threads/processes that you can itemize/count as being under the number of real cores.
AMD Bulldozer, OTOH, claims to have [unit pairing/starvation]
I do often forget that overall when considering some loads :(
actual work done per watt on CPU intensive workloads.
Intel often beats AMD on power, even if only due to die process size. Datacenters tend not to line item 1RU's for actual power used unless you're at 1/4 rack or more and they have metered managed outlets, the costs of which are passed on to you as well. As is wasted power. In the end, for small customers, the DC averages a single price to you into which you fit whatever box you want. A home or corporate DC is different because you can mind your power there to direct/immediate savings.
AMD unfortunately doesn't seem to have a competitive server CPU these days. It's possible that Piledriver improves the situation, but the analysis I saw did not make me optimistic that it would be competitive with Ivy Bridge.
I didn't mean to imply that such a system was true physical server class, but that it could serve the logical function of a decent server/host for the Tor application. Particularly considering the cost per node per megabit as opposed to maximum yield regardless of cost. Real server hardware tends to start well above $1kUSD. Factoring in the performance differences and passing on some features, distributing a lower cost box may be a win there. As a counter to that idea, It might also be useful to consider initial cost vs. megabits delivered over time... if you do not expect to be kicked from DC to DC thus eating multiple shipping costs as add-on capital.
tor-relays@lists.torproject.org