[tor-dev] Detailed relay diagnostics for Re: Raising AuthDirMaxServersPerAddr to 4?

teor teor at riseup.net
Sun Jun 2 12:27:20 UTC 2019


Here are some detailed diagnostics.

My overall conclusion is: there isn't much bandwidth left on that exit.

> On Sun, Jun 02, 2019 at 01:30:18PM +1000, teor wrote:
>> Which bandwidth authorities are limiting the consensus weight of these
>> relays? Where are they located?
> The one in question is in Sweden:
> https://metrics.torproject.org/rs.html#details/D5F2C65F4131A1468D5B67A8838A9B7ED8C049E2
> It has votes of:
> w Bandwidth=10000 Measured=65200
> w Bandwidth=10000 Measured=70000
> w Bandwidth=10000 Measured=74200
> w Bandwidth=10000 Measured=77000
> w Bandwidth=10000 Measured=99400
> w Bandwidth=10000 Measured=102000
> and it currently reports a self-measured peak at 56MBytes/s.
> So one could interpret the current bwauths as saying that it is
> a bit above average compared to other 56MByte/s relays. Maybe that's
> because the other 56MByte/s relays got better lately, or maybe that's
> because there's less overall traffic on the network, but my guess is
> it's because it's stuck in that rut because the bwauths are not good
> at realizing it could go a lot faster.

Well, it's not a simple geographical bias. That's the most common
measurement issue we see. The closest bwauth has the median measurement,
and the North American bwauths are evenly distributed above and below
the median.

Interestingly, sbws measures just slightly above the median, so this
also isn't an instance of torflow's "stuck in a partition" bug.

It would be nice to have some evidence that the relay is stuck, rather
than just slow, poorly connected, or variable.

The Relays Search bandwidth history shows that both relays on that
machine vary a lot:

But it doesn't tell us *why* they vary.

>> Are the relays' observed bandwidths limiting their consensus weight?
> bandwidth 89600000 102400000 55999620
> So it looks like no.

I'm sorry, my question was poorly phrased.

The observed bandwidth is part of the torflow/sbws scaling algorithm,
so it's always limiting the consensus weight.

In this case, if the relay observed more bandwidth, it would get about
1.3x that bandwidth as its consensus weight.

>> If the relays are being measured by longclaw's sbws instance, we should
>> also look at their detailed measurement diagnostics.
> Looks like yes, it is measured:
> w Bandwidth=10000 Measured=78000
> I look forward to hearing about these detailed measurement diagnostics. :)

We wrote a spec to answer all^ your questions:

^ except for these undocumented fields:

Here are some of the diagnostics from the latest bandwidth file:

> 1559468088
> version=1.4.0
> earliest_bandwidth=2019-05-28T09:35:16
> file_created=2019-06-02T09:35:04
> generator_started=2019-05-19T14:04:34
> latest_bandwidth=2019-06-02T09:34:48

sbws has been running for a few weeks, and its still measuring.

> number_consensus_relays=6552
> number_eligible_relays=6302
> percent_eligible_relays=96

It's measuring 96% of Running relays.

> recent_measurement_attempt_count=329137
> recent_measurement_failure_count=301111

It has a 90% measurement failure rate, which is way too high:

But it's still measuring 96% of Running relays, so this bug might
not be as much of a blocker as we thought.

> recent_measurements_excluded_error_count=892
> recent_measurements_excluded_few_count=647
> recent_measurements_excluded_near_count=232
> recent_measurements_excluded_old_count=0

1-4% of measurements are excluded for various reasons. We think
that's normal. But it's hard to check, because torflow has
limited diagnostics.

> software=sbws
> software_version=1.1.0
> time_to_report_half_network=224554

2.6 days is quite a long time to measure half the network.
Probably due to #30719.

And here are the diagnostics for that relay, split over a few lines:

> bw=7700

This is the vote measured bandwidth.

> bw_mean=803269 bw_median=805104

This is the raw measured bandwidth, 784 KBytes/s.
This is a *lot* lower than the observed bandwidth of 56 MBytes/s.

The most likely explanation is that the relay doesn't have much
bandwidth left over.

But maybe this sbws instance needs more bandwidth. If we fixed #30719,
there might be a lot more sbws bandwidth for successful measurements.

> consensus_bandwidth=75000000 consensus_bandwidth_is_unmeasured=False

This is the consensus measured bandwidth in the sbws client's consensus,
converted from scaled-kilobytes to scaled-bytes.

> desc_bw_avg=89600000 desc_bw_bur=102400000

This relay is rate-limited to 85 Mbytes/s.

Maybe it would have more bandwidth if it wasn't rate-limited.

> desc_bw_obs_last=54690734 desc_bw_obs_mean=54690734

sbws is operating off a descriptor, where the observed bandwidth

But the relay is now reporting:

So we might see the consensus weight increase a little bit in the next
day or so.

> error_circ=0 error_destination=0 error_misc=0
> error_second_relay=0 error_stream=0

This relay has no measurement errors.

> master_key_ed25519=Q2Ft/AsNiru+HEx4KRdRxhnuohOs3ByA0t816gUG+Kk
> nick=che node_id=$D5F2C65F4131A1468D5B67A8838A9B7ED8C049E2

Yes, I am analysing the right relay.

> relay_in_recent_consensus_count=310

It has been running for a while. This consensus count is surprising,
but there's no spec for it, so I don't know what it's meant to be:

> relay_recent_measurement_attempt_count=1
> relay_recent_priority_list_count=1

1 measurement in the last 5 days is very low.
Probably due to #30719.

> success=4

4 successful measurements is good, but it's weird that there is only
1 recent measurement attempt. These figures should be similar:

> time=2019-06-01T14:56:32

It was last measured about 18 hours ago.

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 833 bytes
Desc: Message signed with OpenPGP
URL: <http://lists.torproject.org/pipermail/tor-dev/attachments/20190602/07ec2878/attachment.sig>

More information about the tor-dev mailing list