-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
On 12/13/2011 10:34 PM, Mike Perry wrote:
Thus spake Tim Wilde (twilde@cymru.com):
We're not seeing source port exhaustion, but we are seeing two warns, one of which I haven't been able to nail down:
2011 Dec 13 20:22:07.000|[notice] We stalled too much while trying to write 8542 bytes to address "[scrubbed]". If this happens a lot, either something is wrong with your network connection, or something is wrong with theirs. (fd 409, type Directory, state 2, marked at main.c:990).
Hrm.. Haven't seen this one before...
I've seen LOTS of them. :) When I turned off SafeLogging briefly to see what the scrubbed addresses were it turns out they seem to be the dir auths, if that helps:
2011 Dec 14 16:53:17.000|[notice] We stalled too much while trying to write 273 bytes to address "193.23.244.244". If this happens a lot, either something is wrong with your network connection, or something is wrong with theirs. (fd 4108, type Directory, state 2, marked at main.c:990).
I haven't seen any examples of that error message that were not to a dir auth.
2011 Dec 13 22:26:45.000|[warn] Your computer is too slow to handle this many circuit creation requests! Please consider using the MaxAdvertisedBandwidth config option or choosing a more restricted exit policy. [18 similar message(s) suppressed in last 60 seconds]
Ah, we should be handling this issue with the fix for #1984: https://trac.torproject.org/projects/tor/ticket/1984
The second warn I figure I should be tuning myself with MaxAdvertisedBandwidth, and it's happening on BigBoy, the relay on this box that's doing the majority of its bandwidth. So I'm not sure if it's anything that your feedback loop should be involved in or not.
It's a shame this log message makes such a crazy recommendation wrt MaxAdvertisedBandwidth. But I guess some tweak is better than no tweak. Hopefully we can make this go away without you needing to lower it, though. Can you ping me on IRC if you keep getting these warns after leaving MaxAdvertisedBandwidth alone?
Will do.
This sounds incredibly familiar. What ethernet card + driver version do you have? Some combos of are pretty abysmal about IRQ load balancing and interrupt optimizations, or at least they were on old kernels (which may still apply if you are CentOS).
Yeah, some further investigation today indicates that may be the case. :| Running Intel PRO/1000s with the latest E1000E driver (1.6.3-NAPI), I do in fact see what looks like potential interrupt load issues. I've split the relays across to another NIC to see if that helps at all in the relatively short term, long term it looks like a migration away from CentOS for this is called for (it's good for some things, but not this :)). I also rediscovered that the receive packet/flow steering in 2.6.35+ kernels is one of the torservers.net optimizations I haven't done (and can't do on CentOS without a manual kernel installation due to the 2.6.32 kernel it ships with even in 6.0). So it looks like Debian is in this box's future (though I'll try to remember to keep my keys this time :)).
Thanks for the help and suggestions, if I can provide any more info with regards to the stalled writes to the dir auths, let me know.
Thanks, Tim
- -- Tim Wilde, Senior Software Engineer, Team Cymru, Inc. twilde@cymru.com | +1-847-378-3333 | http://www.team-cymru.org/