[tor-bugs] #28328 [Metrics/Website]: inlcude "total consensus" in vote totals graph

Wed Nov 7 05:46:34 UTC 2018

#28328: inlcude "total consensus" in vote totals graph
-----------------------------+------------------------------
 Reporter:  starlight        |          Owner:  metrics-team
     Type:  enhancement      |         Status:  new
 Priority:  Medium           |      Milestone:
Component:  Metrics/Website  |        Version:
 Severity:  Normal           |     Resolution:
 Keywords:                   |  Actual Points:
Parent ID:                   |         Points:
 Reviewer:                   |        Sponsor:
-----------------------------+------------------------------

Comment (by starlight):

 Replying to [comment:6 teor]:
 > I'm not sure I understand the problem, or its likely cause. I am cc'ing
 Mike, because he has more experience with bandwidth weighting.
 >
 > I'm going to ask some questions to work out what is happening. I find
 big blocks of text confusing, so it would help me if you'd answer after
 each question.
 >
 > Replying to [ticket:28328 starlight]:
 > > Totals of consensus weighs shift erratically due to some aspect of
 vote median behavior in the consensus.  E.g. (Exit,Exit+Guard) moved 12.5%
 in 12 hours on 09-Jul-18 12:00 to 23:59 UTC while votes steady.
 >
 > The consensus is created deterministically from the votes. If the votes
 are identical, the consensus will be identical. In particular, the
 consensus weights are the low-median of the votes for each relay: they
 can't change unless the votes change.
 >
 > What is changing in the votes to change the consensus weights?

 The problem I see is that, in aggregate, the median votes values selected
 by the consensus will, in a short span, shift around such that the _total_
 consensus value moves significantly.  This would not matter if individual
 votes were updated as quickly as these shifts in totals, but in practice
 individual relays are often not updated for sometimes two and even three
 days.  Individual relays see their consensus selection probability change
 by 5% or even 10% (because the denominator changes) while the absolute
 median for the relay (numerator) does not move at all.

 In a word:  anachronism

 >
 > Are some authorities not voting?

 Voting continues, but not consistently across the entire set of relays.
 SBWS likely does not suffer from this behavior.

 > Are the Bandwidth= figures in the votes actually different?

 Per the above, some change some do not.

 An easy way to think about this is cases where one of the bwauths drops
 out for a few hours or a day or two.  The consensus total will experience
 a huge jump in one hour but many relay median votes do not move at all.
 This is the extreme case but it happens all the time without a bwauth
 withdraw or join event.

 > Or, are you talking about overall relay selection probability, which
 depends on the total consensus weight?

 This is all about the totals moving and shifting votes that are not
 refreshing as quickly.  Each relay class operates independently as a
 practical matter, and exits have the worst time if it.

 > Do other relays start Running or stop Running?

 Relays are generally stable.  It seems to me that occasionally a big
 operator will take down or start up a block of a dozen or so high-
 bandwidth nodes and this can trigger a shift, but it's not the principal
 cause.  The "rc" columns and percentages in the CSV can be used to look
 for these.

 > Do some relays start or stop being Guard or Exit?

 Possibly, but again these events are not a big problem as AFAICT.

 > > Twenty percent in 56 hours with votes shifting.  The behavior results
 in significant adjustment to the selection probability of relays with
 unchanged consensus weights.
 >
 > The goal of the bandwidth weighting system is to provide a set of
 weights that give clients equal performance, regardless of the particular
 relays they choose.
 >
 > Maybe the load on the relay changes erratically, so its selection
 probability should also change?

 Again, in this situation I'm focused on consensus totals.  Something about
 the way Torflow votes from different authorities interact results in the
 medians shifting wholesale while the individual votes sets appear mostly
 stable.  I did not try to analyze the exact nature of it, figuring it
 would be worth the trouble only if the new system experiences this.

 > Maybe other available relays change their performance, so this relay
 should get used more (or less)?
 >
 > Do these erratic changes affect client performance?

 Clients use selection probability, so yes for sure.  If a node's
 probability changes because the denominator moved, the number is still
 different.

 > Would clients perform better or worse without these erratic changes?

 I believe this contributes to misrating, especially for faster relays
 where the offset ratios are high, +1 and above (i.e 2x the average) and
 could be a factor in relays overloading and seizing up as often happens.
 I notice this when using SSH frequently--a good session will abruptly
 become terrible or just freeze.

 >
 > > Please add to
 > >
 > > https://metrics.torproject.org/totalcw.html
 >
 > I think a separate graph would be better: having 6 authorities * 5
 categories = 30 lines on a graph will be unmanageable.

 sure, works for me

 >
 > Replying to [comment:5 starlight]:
 > > I thought more about weighting the values (as in Relay Search), but it
 makes no difference for the purpose which is to see if the totals of
 medians continue jumping about with SBWS as presently happens with
 Torflow.  Simply graphing the total consensus for each selection class,
 Exit, Guard, Middle is sufficient.
 >
 > I agree we should monitor the behavior of each class of relays.
 >
 > > (Exit,Exit+Guard) is the total of Exit-Only and Exit+Guard flagged
 relays as this is the set used for choosing exits
 >
 > No, this is the set that is *currently* used for choosing exits. If tor
 gets more exits in future, then Exit+Guard may be used as Guard.

 yes, the weights. . .haven't fully wrapped my mind around how it all works

 >
 > So we shouldn't hard-code the assumption that Exit+Guard is only used as
 an Exit.
 >
 > Instead, I suggest that we match the sets in
 https://metrics.torproject.org/bwhist-flags.html
 >
 > Guard & Exit
 > Guard only
 > Exit only
 > Middle only
 >
 > I noticed some other things while reviewing this ticket, I'll create
 child tickets for them.

 will watch with interest

--
Ticket URL: <https://trac.torproject.org/projects/tor/ticket/28328#comment:7>
Tor Bug Tracker & Wiki <https://trac.torproject.org/>
The Tor Project: anonymity online