Re: [tor-relays] Overloaded state indicator on relay-search

18 Oct 2021

      On 17 Oct (13:54:22), Arlen Yaroslav via tor-relays wrote:
...
Hi,
Hi Arlen!
...
I've done some further analysis on this. The reason my relay is being marked
as overloaded is because of DNS timeout errors. I had to dive into the
source code to figure this out.
In dns.c, a libevent DNS_ERR_TIMEOUT is being recorded as an
OVERLOAD_GENERAL error. Am I correct in saying that a single DNS timeout
error within a 72-hour period will result in an overloaded state? If so, it
seems overly-stringent given that there are no options available to tune the
DNS timeout, max retry etc. parameters. Some lower-specced servers with less
than optimal access to DNS resolvers will suffer because of this.
Correct, 1 single DNS timeout will trigger the general overload flag. There
were discussion to make it N% of all request to timeout before we would report
it with a N being around 1% but unfortunately that was never implemented that
way. And so, at the moment, 1 timeout is enough to trigger the problem.

And I think you are right, we would benefit on raising that threshold big
time.
...
Also, I was wondering why these timeouts were not being recorded in the
Metrics output. I've done some digging and I believe there is a bug in the
evdns_callback() function. The rep_hist_note_dns_error() is being called as
follows:
rep_hist_note_dns_error(type, result);
but I've noticed the 'type' being set to zero whenever libevent returns a
DNS error which means the correct dns_stats_t structure is never found, as
zero is outside the expected range of values (DNS_IPv4_A, DNS_PTR,
DNS_IPv6_AAAA). Adding the BUG assertion confirms this.
Please let me know if I should raise this in the bug tracker or if you need
anything else.
This is an _excellent_ find!

I have opened:

https://gitlab.torproject.org/tpo/core/tor/-/issues/40490

We'll likely attempt to submit a patch to libevent and then fix that in Tor.
Until this is fixed in libevent and the entire network can migrate (which can
be years...), we'll have to live with DNS errors _not_ being per-type on the
MetricsPort likely going from:

tor_relay_exit_dns_error_total{record="A",reason="timeout"} 0
...

to a line without a "record" because we can't tell:

tor_relay_exit_dns_error_total{reason="timeout"} 0

Note that for a successful request that is reason="success", we can tell which
record type but not for errors because of that.

To everyone, expect that API breakage on the MetricsPort for the next 0.4.7.x
version and evidently when the stable comes out.

Big thanks for this find!

Cheers!
David

-- 
ntAC7gj16wctf1lTaBQoW+wcUkFG+MROtH5KheSa698=

Re: [tor-relays] Overloaded state indicator on relay-search

David Goulet