[tor-dev] Detailed relay diagnostics for Re: Raising AuthDirMaxServersPerAddr to 4?

2 Jun 2019

      Hi,

Here are some detailed diagnostics.

My overall conclusion is: there isn't much bandwidth left on that exit.
...
On Sun, Jun 02, 2019 at 01:30:18PM +1000, teor wrote:
...
Which bandwidth authorities are limiting the consensus weight of these
relays? Where are they located?
The one in question is in Sweden:
https://metrics.torproject.org/rs.html#details/D5F2C65F4131A1468D5B67A8838A9...
It has votes of:
w Bandwidth=10000 Measured=65200
w Bandwidth=10000 Measured=70000
w Bandwidth=10000 Measured=74200
w Bandwidth=10000 Measured=77000
w Bandwidth=10000 Measured=99400
w Bandwidth=10000 Measured=102000
and it currently reports a self-measured peak at 56MBytes/s.
So one could interpret the current bwauths as saying that it is
a bit above average compared to other 56MByte/s relays. Maybe that's
because the other 56MByte/s relays got better lately, or maybe that's
because there's less overall traffic on the network, but my guess is
it's because it's stuck in that rut because the bwauths are not good
at realizing it could go a lot faster.
Well, it's not a simple geographical bias. That's the most common
measurement issue we see. The closest bwauth has the median measurement,
and the North American bwauths are evenly distributed above and below
the median.

Interestingly, sbws measures just slightly above the median, so this
also isn't an instance of torflow's "stuck in a partition" bug.

It would be nice to have some evidence that the relay is stuck, rather
than just slow, poorly connected, or variable.

The Relays Search bandwidth history shows that both relays on that
machine vary a lot:
https://metrics.torproject.org/rs.html#details/D5F2C65F4131A1468D5B67A8838A9...
https://metrics.torproject.org/rs.html#details/6B37261F1248DA6E6BB924161F8D7...

But it doesn't tell us *why* they vary.
...
...
Are the relays' observed bandwidths limiting their consensus weight?
bandwidth 89600000 102400000 55999620
So it looks like no.
I'm sorry, my question was poorly phrased.

The observed bandwidth is part of the torflow/sbws scaling algorithm,
so it's always limiting the consensus weight.

In this case, if the relay observed more bandwidth, it would get about
1.3x that bandwidth as its consensus weight.
...
...
If the relays are being measured by longclaw's sbws instance, we should
also look at their detailed measurement diagnostics.
Looks like yes, it is measured:
w Bandwidth=10000 Measured=78000
I look forward to hearing about these detailed measurement diagnostics. :)
We wrote a spec to answer all^ your questions:
https://gitweb.torproject.org/torspec.git/tree/bandwidth-file-spec.txt

^ except for these undocumented fields:
https://trac.torproject.org/projects/tor/ticket/30726

Here are some of the diagnostics from the latest bandwidth file:
...
1559468088
version=1.4.0
earliest_bandwidth=2019-05-28T09:35:16
file_created=2019-06-02T09:35:04
generator_started=2019-05-19T14:04:34
latest_bandwidth=2019-06-02T09:34:48
sbws has been running for a few weeks, and its still measuring.
...
number_consensus_relays=6552
number_eligible_relays=6302
percent_eligible_relays=96
It's measuring 96% of Running relays.
...
recent_measurement_attempt_count=329137
recent_measurement_failure_count=301111
It has a 90% measurement failure rate, which is way too high:
https://trac.torproject.org/projects/tor/ticket/30719

But it's still measuring 96% of Running relays, so this bug might
not be as much of a blocker as we thought.
...
recent_measurements_excluded_error_count=892
recent_measurements_excluded_few_count=647
recent_measurements_excluded_near_count=232
recent_measurements_excluded_old_count=0
1-4% of measurements are excluded for various reasons. We think
that's normal. But it's hard to check, because torflow has
limited diagnostics.
...
software=sbws
software_version=1.1.0
time_to_report_half_network=224554
2.6 days is quite a long time to measure half the network.
Probably due to #30719.

And here are the diagnostics for that relay, split over a few lines:
...
bw=7700
This is the vote measured bandwidth.
...
bw_mean=803269 bw_median=805104
This is the raw measured bandwidth, 784 KBytes/s.
This is a *lot* lower than the observed bandwidth of 56 MBytes/s.

The most likely explanation is that the relay doesn't have much
bandwidth left over.

But maybe this sbws instance needs more bandwidth. If we fixed #30719,
there might be a lot more sbws bandwidth for successful measurements.
...
consensus_bandwidth=75000000 consensus_bandwidth_is_unmeasured=False
This is the consensus measured bandwidth in the sbws client's consensus,
converted from scaled-kilobytes to scaled-bytes.
...
desc_bw_avg=89600000 desc_bw_bur=102400000
This relay is rate-limited to 85 Mbytes/s.

Maybe it would have more bandwidth if it wasn't rate-limited.
...
desc_bw_obs_last=54690734 desc_bw_obs_mean=54690734
sbws is operating off a descriptor, where the observed bandwidth
was:
54690734

But the relay is now reporting:
55999620

So we might see the consensus weight increase a little bit in the next
day or so.
...
error_circ=0 error_destination=0 error_misc=0
error_second_relay=0 error_stream=0
This relay has no measurement errors.
...
master_key_ed25519=Q2Ft/AsNiru+HEx4KRdRxhnuohOs3ByA0t816gUG+Kk
nick=che node_id=$D5F2C65F4131A1468D5B67A8838A9B7ED8C049E2
Yes, I am analysing the right relay.
...
relay_in_recent_consensus_count=310
It has been running for a while. This consensus count is surprising,
but there's no spec for it, so I don't know what it's meant to be:
https://trac.torproject.org/projects/tor/ticket/30724
https://trac.torproject.org/projects/tor/ticket/30726
...
relay_recent_measurement_attempt_count=1
relay_recent_priority_list_count=1
1 measurement in the last 5 days is very low.
Probably due to #30719.
...
success=4
4 successful measurements is good, but it's weird that there is only
1 recent measurement attempt. These figures should be similar:
https://trac.torproject.org/projects/tor/ticket/30725
...
time=2019-06-01T14:56:32
It was last measured about 18 hours ago.

T