URGENT: patch needed ASAP for authority bug

Scott Bennett bennett at cs.niu.edu
Fri Apr 16 05:45:44 UTC 2010


     On Thu, 15 Apr 2010 14:59:41 +0200 Sebastian Hahn <mail at sebastianhahn.net>
wrote:
>no reason to panic currently. I've cc'ed Mike and Nick here, in case  
>they
>can better explain what is going on.
>
>On Apr 15, 2010, at 2:42 PM, Scott Bennett wrote:
>>     I believe I spotted an authority bug with pretty severe  
>> consequences
>> this a.m.  It is having seriously bad effect on the star heavyweight  
>> node
>> of the tor network, Olaf Selke's blutmagie.  I can't submit a PR for  
>> it
>> due to the flyspray web page's problems with letting me log in, and  
>> Olaf
>> wrote me that he's at work at the moment and can't submit a PR until  
>> he
>> gets home after work.  So please read on, and if someone would please
>> submit an urgent PR for this, we (and probably others) would  
>> appreciate it.
>> If you do, please shoot a note off to Olaf <olaf.selke at blutmagie.de>  
>> to
>> let him know about it, so he won't submit a duplicate PR.  I don't  
>> think
>> a fix for this one should wait for the next release.  Instead,  
>> patches for
>> both "stable" and "alpha" branches should be made available to  
>> authority
>> operators as soon as someone can come up with them.  (Only the  
>> authorities
>> need to be fixed right away because the bug is somewhere in the  
>> authority
>> code for generating consensus entries.)
>>     Here's what I found.  blutmagie's torrc is set up for a target
>> throughput rate of 18000 KB/s and a maximum burst rate of 24000 KB/s.
>> Olaf noticed that blutmagie was being swamped by a horrendous load of
>> incoming connections nearly all the time, so he tried using
>> MaxAdvertisedBandwidth to reduce the frequency of inbound connections.
>> He repeatedly lowered the maximum advertised rate, and blutmagie's
>> descriptor correctly reflects that, now showing a target rate of  
>> 2000 KB/s,
>> but the connection rate showed no apparent change.  He recently began
>> reporting this trouble on OR-TALK, IIRC, but no one seemed to know  
>> why the
>> limit on the advertised target rate, even when set so low compared  
>> to the
>> actual rate and also compared to the rates published by other  
>> heavyweight
>> nodes, why the advertised rate didn't reduce the load.
>>     The problem lies in the consensus document, where it shows (or did
>> an hour or so ago),
>>
>> w Bandwidth=27900
>>
>> Note that 27900 KB/s is considerably higher than the maximum burst  
>> rate
>> in the descriptor and is 13.95 times the supposed maximum advertised  
>> rate.
>> That means that, while old client versions that use the values in the
>> descriptors in their route selection process will probably honor the  
>> maximum
>> advertised rate of 2000 KB/s, newer clients use the rate in the  
>> consensus,
>> 27900 KB/s, in theirs, thus continuing to drown blutmagie in an  
>> ongoing
>> flood of incoming connections.
>>     The authorities are currently disregarding the limit published  
>> in every
>> node's descriptor and instead are conjuring up their own numbers.   
>> This needs
>> to stop and right away.
>
>The value in the consensus is not an actual bandwidth, but rather it  
>is a
>bandwidth weight, used by clients to do load balancing. This value is

     Then it is very misleadingly labeled in the consensus, isn't it?

>automatically determined by directory authorities doing active
>measurements of nodes capacity, to more evenly distribute the load.

     Except that the directory authorities are using invalid sources
of information as input to those calculations.  As I keep pointing,
there is only one vantage point from which those measurements can be
made for any given node, and that is the tor process running on that
node itself.  Neither the directory authorities nor any remote,
undisclosed measurers can possibly gather valid information for the
purpose mentioned.  The whole design of it is provably invalid and
should be dropped ASAP.

>Blutmagie, due to having huge capacity, gets a big share of the network
>by having a lot of unused bandwidth. I have warned that this might  
>lead to
>sad consequences, as available bandwidth is not the only factor to
>determine how much traffic a node can handle, but rather there are other
>things to take into account (number of circuits you need to establish,
>higher memory requirements to service lots of connections compared to
>only one connection that the bandwidth scanner uses, higher overhead
>when more connections need to be handled).

     Exactly.  What I didn't know before this--is it documented anywhere?--
is that the limiting field in the descriptors published by relays is now
ignored by the authorities as upper limits for the individual relays, as
you describe next.
>
>Another side-effect is that limiting your bandwidth via MaxAdvertised*
>options is no longer viable, because the active measurements are
>affecting circuit building, not the passive advertised values. This has

     "affecting" = "distorting" in this case.

>bad consequences for everyone who tries to attract few clients, but
>has lots of bandwidth (we're seeing the problem on a few vservers as
>well).
>
>I'm not sure what can be done about this, because measuring
>bandwidth is easy and has led to dramatic speed increases in the
>network for people running the 0.2.2.x versions (only those use the
>bandwidth weights currently, afaik); whereas measuring a node's

     On what evidence do you make that second claim?  I've been using
the -alpha series for the last few years.  During that time, I have
noticed "sudden" improvements in latency and/or speed in only two
kinds of situations.  One was the obvious kind that sometimes happened
in a matter of days:  sudden, if perhaps temporary, grown of 20% - 35%
in the population count of relays.  Those performance boosts were often
really nice experiences.  The other usually involved a somewhat slower
definition of "sudden" and occurred during the times that a major ISP
network (e.g., Comcast in the U.S) made big upgrades in their physical
speeds and capacities (e.g., upgrading a cable network from DOCSIS 2.0
to 3.0).  I've noticed *no* improvement that correlates with the
changeover to throughput ratings magically conjured up by the authorities,
although it did seem for a while that it might have made things very
slightly worse than before.

>capacity to deal with massive amounts of connections is not trivial.

     And it can *only* be done by that node itself, not by any other means.
>
>Something that might or might not figure into this is that newly started
>Tor clients do active speed tests, building test circuits for the  
>first ~hour
>and a half to find a good value for timing out slow circuits. These
>additional circuits might explain a generally higher load on the relays,
>but I'm not sure about this here.
>
>So, to summarize: There is currently no bug in the authority code, they
>are working as intended. I'm waiting for Mike's further input here to

     They may be working as intended, but that does not negate the fact
of a massive design bug (or collection of bugs) in the authority code.

>see if we need or can do something about the trouble it seems to
>create for blutmagie.
>
     I wish you success, of course, but you'll have to abandon your new
method to be able to get things really right.


                                  Scott Bennett, Comm. ASMELG, CFIAG
**********************************************************************
* Internet:       bennett at cs.niu.edu                              *
*--------------------------------------------------------------------*
* "A well regulated and disciplined militia, is at all times a good  *
* objection to the introduction of that bane of all free governments *
* -- a standing army."                                               *
*    -- Gov. John Hancock, New York Journal, 28 January 1790         *
**********************************************************************



More information about the tor-relays mailing list