URGENT: patch needed ASAP for authority bug

Sebastian Hahn mail at sebastianhahn.net
Thu Apr 15 12:59:41 UTC 2010


Hi Scott,

no reason to panic currently. I've cc'ed Mike and Nick here, in case  
they
can better explain what is going on.

On Apr 15, 2010, at 2:42 PM, Scott Bennett wrote:
>     I believe I spotted an authority bug with pretty severe  
> consequences
> this a.m.  It is having seriously bad effect on the star heavyweight  
> node
> of the tor network, Olaf Selke's blutmagie.  I can't submit a PR for  
> it
> due to the flyspray web page's problems with letting me log in, and  
> Olaf
> wrote me that he's at work at the moment and can't submit a PR until  
> he
> gets home after work.  So please read on, and if someone would please
> submit an urgent PR for this, we (and probably others) would  
> appreciate it.
> If you do, please shoot a note off to Olaf <olaf.selke at blutmagie.de>  
> to
> let him know about it, so he won't submit a duplicate PR.  I don't  
> think
> a fix for this one should wait for the next release.  Instead,  
> patches for
> both "stable" and "alpha" branches should be made available to  
> authority
> operators as soon as someone can come up with them.  (Only the  
> authorities
> need to be fixed right away because the bug is somewhere in the  
> authority
> code for generating consensus entries.)
>     Here's what I found.  blutmagie's torrc is set up for a target
> throughput rate of 18000 KB/s and a maximum burst rate of 24000 KB/s.
> Olaf noticed that blutmagie was being swamped by a horrendous load of
> incoming connections nearly all the time, so he tried using
> MaxAdvertisedBandwidth to reduce the frequency of inbound connections.
> He repeatedly lowered the maximum advertised rate, and blutmagie's
> descriptor correctly reflects that, now showing a target rate of  
> 2000 KB/s,
> but the connection rate showed no apparent change.  He recently began
> reporting this trouble on OR-TALK, IIRC, but no one seemed to know  
> why the
> limit on the advertised target rate, even when set so low compared  
> to the
> actual rate and also compared to the rates published by other  
> heavyweight
> nodes, why the advertised rate didn't reduce the load.
>     The problem lies in the consensus document, where it shows (or did
> an hour or so ago),
>
> w Bandwidth=27900
>
> Note that 27900 KB/s is considerably higher than the maximum burst  
> rate
> in the descriptor and is 13.95 times the supposed maximum advertised  
> rate.
> That means that, while old client versions that use the values in the
> descriptors in their route selection process will probably honor the  
> maximum
> advertised rate of 2000 KB/s, newer clients use the rate in the  
> consensus,
> 27900 KB/s, in theirs, thus continuing to drown blutmagie in an  
> ongoing
> flood of incoming connections.
>     The authorities are currently disregarding the limit published  
> in every
> node's descriptor and instead are conjuring up their own numbers.   
> This needs
> to stop and right away.

The value in the consensus is not an actual bandwidth, but rather it  
is a
bandwidth weight, used by clients to do load balancing. This value is
automatically determined by directory authorities doing active
measurements of nodes capacity, to more evenly distribute the load.
Blutmagie, due to having huge capacity, gets a big share of the network
by having a lot of unused bandwidth. I have warned that this might  
lead to
sad consequences, as available bandwidth is not the only factor to
determine how much traffic a node can handle, but rather there are other
things to take into account (number of circuits you need to establish,
higher memory requirements to service lots of connections compared to
only one connection that the bandwidth scanner uses, higher overhead
when more connections need to be handled).

Another side-effect is that limiting your bandwidth via MaxAdvertised*
options is no longer viable, because the active measurements are
affecting circuit building, not the passive advertised values. This has
bad consequences for everyone who tries to attract few clients, but
has lots of bandwidth (we're seeing the problem on a few vservers as
well).

I'm not sure what can be done about this, because measuring
bandwidth is easy and has led to dramatic speed increases in the
network for people running the 0.2.2.x versions (only those use the
bandwidth weights currently, afaik); whereas measuring a node's
capacity to deal with massive amounts of connections is not trivial.

Something that might or might not figure into this is that newly started
Tor clients do active speed tests, building test circuits for the  
first ~hour
and a half to find a good value for timing out slow circuits. These
additional circuits might explain a generally higher load on the relays,
but I'm not sure about this here.

So, to summarize: There is currently no bug in the authority code, they
are working as intended. I'm waiting for Mike's further input here to
see if we need or can do something about the trouble it seems to
create for blutmagie.

Sebastian







More information about the tor-relays mailing list