[metrics-bugs] #28305 [Metrics/Statistics]: Include client numbers even if we think we got reports from more than 100% of all relays

Wed Nov 28 22:39:55 UTC 2018

#28305: Include client numbers even if we think we got reports from more than 100%
of all relays
--------------------------------+------------------------------
 Reporter:  karsten             |          Owner:  karsten
     Type:  defect              |         Status:  accepted
 Priority:  High                |      Milestone:
Component:  Metrics/Statistics  |        Version:
 Severity:  Normal              |     Resolution:
 Keywords:                      |  Actual Points:
Parent ID:                      |         Points:
 Reviewer:                      |        Sponsor:  SponsorV-can
--------------------------------+------------------------------

Comment (by teor):

 Replying to [comment:5 karsten]:
 > I think I now know what's going on: some relays report written directory
 byte statistics for times when they were hardly listed in consensuses.
 >
 > Here's a graph ...
 >
 > Note the red arrow. At this point `n(H)` grows larger than `n(N)`.
 That's an issue. By definition, a relay cannot report written directory
 bytes statistics for a longer time than it's online.

 But relays that aren't listed in the consensus can still be acting as
 relays.

 Here are a few scenarios where that happens:
 * the relay's IPv4 address is unreachable from a majority of directory
 authorities, but some clients (with old consensuses) can still reach it:
 * the relay's IPv4 address has changed, and the authorities haven't
 checked the new address, but the relay is still reachable on the old
 address cached at some clients
 * the same scenarios with IPv6, but there are only 6/9 authorities that
 check and vote on IPv6
 * the relay is configured as a bridge by some clients, but it publishes
 descriptors as a relay

 If a relay drops in and out of the consensus every few hours, there will
 always be some clients with a consensus containing that relay.

 > I also looked at random relay `002B024E24A30F113982FCB17DFE05B6F38C0C79`
 that had a larger `n(H)` value than `n(N)` value on 2018-10-28:
 >
 >  - This relay was listed in 3 out of 24 consensuses on 2018-10-28
 (19:00, 20:00, and 21:00). As a result, we count this relay with `n(N) =
 10800` (we're using seconds internally, not hours).
 >  - The same relay published an extra-info descriptor on 2018-10-31 at
 09:28:04 with the following line: `dirreq-write-history 2018-10-30
 08:04:04 (86400 s) 0,0`. We count this as `n(H) = 57356` on 2018-10-28.
 >
 > A possible mitigation (other than the one I suggested above) could be to
 replace `n(H)` with `n(N^H)` in the `frac` formula. This would mean that
 we'd cap the amount of time for which a relay reported written directory
 bytes to the amount of time it was listed in the consensus.

 This seems like a reasonable approach: if the relay is listed in the
 consensus for `n(N^H)` seconds, then we should weight its bandwidth using
 that number of seconds.

 > I'm currently dumping and downloading the database to try this out at
 home. However, I'm afraid that deploying this fix is going to be much more
 expensive than making the simple fix suggested above. I'll report here
 what I find out.

 I'm not sure if it will make much of a difference long-term: relays that
 drop out of the consensus should have low bandwidth weights, and therefore
 low bandwidths. (Except when the network is unstable, or there are less
 than 3 bandwidth authorities.)

--
Ticket URL: <https://trac.torproject.org/projects/tor/ticket/28305#comment:6>
Tor Bug Tracker & Wiki <https://trac.torproject.org/>
The Tor Project: anonymity online