On 17 Oct (13:54:22), Arlen Yaroslav via tor-relays wrote:
Hi,
Hi Arlen!
I've done some further analysis on this. The reason my relay is being marked as overloaded is because of DNS timeout errors. I had to dive into the source code to figure this out.
In dns.c, a libevent DNS_ERR_TIMEOUT is being recorded as an OVERLOAD_GENERAL error. Am I correct in saying that a single DNS timeout error within a 72-hour period will result in an overloaded state? If so, it seems overly-stringent given that there are no options available to tune the DNS timeout, max retry etc. parameters. Some lower-specced servers with less than optimal access to DNS resolvers will suffer because of this.
Correct, 1 single DNS timeout will trigger the general overload flag. There were discussion to make it N% of all request to timeout before we would report it with a N being around 1% but unfortunately that was never implemented that way. And so, at the moment, 1 timeout is enough to trigger the problem.
And I think you are right, we would benefit on raising that threshold big time.
Also, I was wondering why these timeouts were not being recorded in the Metrics output. I've done some digging and I believe there is a bug in the evdns_callback() function. The rep_hist_note_dns_error() is being called as follows:
rep_hist_note_dns_error(type, result);
but I've noticed the 'type' being set to zero whenever libevent returns a DNS error which means the correct dns_stats_t structure is never found, as zero is outside the expected range of values (DNS_IPv4_A, DNS_PTR, DNS_IPv6_AAAA). Adding the BUG assertion confirms this.
Please let me know if I should raise this in the bug tracker or if you need anything else.
This is an _excellent_ find!
I have opened:
https://gitlab.torproject.org/tpo/core/tor/-/issues/40490
We'll likely attempt to submit a patch to libevent and then fix that in Tor. Until this is fixed in libevent and the entire network can migrate (which can be years...), we'll have to live with DNS errors _not_ being per-type on the MetricsPort likely going from:
tor_relay_exit_dns_error_total{record="A",reason="timeout"} 0 ...
to a line without a "record" because we can't tell:
tor_relay_exit_dns_error_total{reason="timeout"} 0
Note that for a successful request that is reason="success", we can tell which record type but not for errors because of that.
To everyone, expect that API breakage on the MetricsPort for the next 0.4.7.x version and evidently when the stable comes out.
Big thanks for this find!
Cheers! David