Hi folks,
I've been probing which relays can't extend to my relay: https://metrics.torproject.org/rs.html#details/7B35DB92BA72BA0BBFD51B35B11A4...
I'm simply making two-hop paths from my client through every relay to my destination relay.
Here's a snapshot of some relays that fail most/all of the extend attempts.
The first data set is Feb 27 through Mar 1, and I've attached it as feb27-mar1.txt. The second data set is Mar 30, and it's attached as mar30.txt.
The way to read each line of the ratio is: number of failed extend attempts / (number of failed + number of succeeded).
(There are some edge cases where I failed to reach the relay as the first hop, for example because it has gone down, or it's in the new consensus but my Tor client doesn't have the descriptor for it yet. I've omitted those edge cases from the ratio calculation, since they're not failures and they're not successes so they just muddy the analysis.)
Then I combined them: cat feb27-mar1.txt mar30.txt |sort > combined.txt to make it easy to see which relays had problems in both sets.
I've started mailing some of the relay operators to ask them to investigate. (Many of them alas have no workable contactinfo.) The most likely explanation is that they're out of file descriptors (in which case their Tor logs should be complaining constantly). A backup explanation might be that their relay is censored from reaching my relay, perhaps by destination port or address.
Eventually we'll want to use these results to compare against the self-reported "overload" metrics from Proposal 328 ("Make Relays Report When They Are Overloaded". I'm not sure what the step after that is, but maybe it will be reducing the consensus weights for relays that aren't performing well.
--Roger
On Tue, Mar 30, 2021 at 10:09:19PM -0400, Roger Dingledine wrote:
The first data set is Feb 27 through Mar 1, and I've attached it as feb27-mar1.txt. The second data set is Mar 30, and it's attached as mar30.txt.
Whoops. I reversed the names of those files. So the one that starts with 18 failures is the more recent one, and the one that starts with 42 failures is the older one. Other than that, everything is still right. :)
--Roger
Hi Roger,
On 31/03/2021 04:09, Roger Dingledine wrote:
(There are some edge cases where I failed to reach the relay as the first hop, for example because it has gone down, or it's in the new consensus but my Tor client doesn't have the descriptor for it yet. I've omitted those edge cases from the ratio calculation, since they're not failures and they're not successes so they just muddy the analysis.)
Just to check I understand correctly - The attempt is considered a failure if (and only if) you can connect to the test relay correctly, but can't extend from the test relay to your own relay?
As many of these relays have weights, presumably they can successfully extend to at least some relays in order to have their bandwidth measured. I wonder how the probabilities would look if you tested with a some other (highly weighted) relays in the 2nd-hop position?
I'm simply making two-hop paths from my client through every relay to my destination relay.
Here's a snapshot of some relays that fail most/all of the extend attempts.
Would you be comfortable sharing the unfiltered dataset? It would be interesting to approximate the probability a client circuit is impacted by this kind of failure.
I wrote a little python script (attached) which uses your output and Onionoo's provided probabilities:
python3 weights.py Loaded 180 relay extension failure probabilities from Roger's dataset Loaded 7103 relay circuit probabilities from Onionoo Over-estimate of probability circuit is impacted by connectivity issue 6.0070908572010426e-05
Best,
Dennis
On Wed, Mar 31, 2021 at 11:14:11AM +0200, Dennis Jackson wrote:
Just to check I understand correctly - The attempt is considered a failure if (and only if) you can connect to the test relay correctly, but can't extend from the test relay to your own relay?
Right.
As many of these relays have weights, presumably they can successfully extend to at least some relays in order to have their bandwidth measured. I wonder how the probabilities would look if you tested with a some other (highly weighted) relays in the 2nd-hop position?
Right. I intentionally picked my own tiny relay for the second hop, first so that I know it's working (to remove that extra variable), but second because there probably won't already be a TLS connection open (because I wanted to test both the TCP/TLS connectivity and also the circuit extend part).
You're right that a good follow-up test would be to compare these results to one with a hugely popular second hop relay, because then there's more chance of an existing long-term conn in place (though that's complicated by flags -- e.g. exit relays probably won't connect to guard relays, but guard relays might be used in the middle hop to connect to exit relays, and circuit extends don't care about which way the existing orconn got originally created).
Would you be comfortable sharing the unfiltered dataset? It would be interesting to approximate the probability a client circuit is impacted by this kind of failure.
Yeah, I'll publish all these things once I figure out the right place for them (currently they're inside my bermuda repo, which is one of the bad-relays repos).
But in the mean time, in case you are excited for some more scripting, here are the full output results for the past day, plus the little perl script I use for turning the results into that ratio format you saw earlier. This is more data than the mail from 8 hours ago because the scans are still going.
I retest successes after 12 hours, and failures on the first hop after 2 hours, and failures on the second hop after 1 hour, so that's why you'll see more attempts to fingerprints that are more flaky. I should be able to reconstruct approximate timestamps for each test if we find a use for them, and I have the full set of circ controller events too.
--Roger
network-health@lists.torproject.org