[tor-relays] Overloaded state indicator on relay-search

David Goulet dgoulet at torproject.org
Mon Oct 18 12:43:24 UTC 2021


On 17 Oct (13:54:22), Arlen Yaroslav via tor-relays wrote:
> Hi,

Hi Arlen!

> 
> I've done some further analysis on this. The reason my relay is being marked
> as overloaded is because of DNS timeout errors. I had to dive into the
> source code to figure this out.
> 
> In dns.c, a libevent DNS_ERR_TIMEOUT is being recorded as an
> OVERLOAD_GENERAL error. Am I correct in saying that a single DNS timeout
> error within a 72-hour period will result in an overloaded state? If so, it
> seems overly-stringent given that there are no options available to tune the
> DNS timeout, max retry etc. parameters. Some lower-specced servers with less
> than optimal access to DNS resolvers will suffer because of this.

Correct, 1 single DNS timeout will trigger the general overload flag. There
were discussion to make it N% of all request to timeout before we would report
it with a N being around 1% but unfortunately that was never implemented that
way. And so, at the moment, 1 timeout is enough to trigger the problem.

And I think you are right, we would benefit on raising that threshold big
time.

> 
> Also, I was wondering why these timeouts were not being recorded in the
> Metrics output. I've done some digging and I believe there is a bug in the
> evdns_callback() function. The rep_hist_note_dns_error() is being called as
> follows:
> 
> rep_hist_note_dns_error(type, result);
> 
> but I've noticed the 'type' being set to zero whenever libevent returns a
> DNS error which means the correct dns_stats_t structure is never found, as
> zero is outside the expected range of values (DNS_IPv4_A, DNS_PTR,
> DNS_IPv6_AAAA). Adding the BUG assertion confirms this.
> 
> Please let me know if I should raise this in the bug tracker or if you need
> anything else.

This is an _excellent_ find!

I have opened:

https://gitlab.torproject.org/tpo/core/tor/-/issues/40490

We'll likely attempt to submit a patch to libevent and then fix that in Tor.
Until this is fixed in libevent and the entire network can migrate (which can
be years...), we'll have to live with DNS errors _not_ being per-type on the
MetricsPort likely going from:

tor_relay_exit_dns_error_total{record="A",reason="timeout"} 0
...

to a line without a "record" because we can't tell:

tor_relay_exit_dns_error_total{reason="timeout"} 0

Note that for a successful request that is reason="success", we can tell which
record type but not for errors because of that.

To everyone, expect that API breakage on the MetricsPort for the next 0.4.7.x
version and evidently when the stable comes out.

Big thanks for this find!

Cheers!
David

-- 
ntAC7gj16wctf1lTaBQoW+wcUkFG+MROtH5KheSa698=
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 488 bytes
Desc: not available
URL: <http://lists.torproject.org/pipermail/tor-relays/attachments/20211018/1e1dc186/attachment.sig>


More information about the tor-relays mailing list