Hi,
I've done some further analysis on this. The reason my relay is being marked as overloaded is because of DNS timeout errors. I had to dive into the source code to figure this out.
In dns.c, a libevent DNS_ERR_TIMEOUT is being recorded as an OVERLOAD_GENERAL error. Am I correct in saying that a single DNS timeout error within a 72-hour period will result in an overloaded state? If so, it seems overly-stringent given that there are no options available to tune the DNS timeout, max retry etc. parameters. Some lower-specced servers with less than optimal access to DNS resolvers will suffer because of this.
Also, I was wondering why these timeouts were not being recorded in the Metrics output. I've done some digging and I believe there is a bug in the evdns_callback() function. The rep_hist_note_dns_error() is being called as follows:
rep_hist_note_dns_error(type, result);
but I've noticed the 'type' being set to zero whenever libevent returns a DNS error which means the correct dns_stats_t structure is never found, as zero is outside the expected range of values (DNS_IPv4_A, DNS_PTR, DNS_IPv6_AAAA). Adding the BUG assertion confirms this.
Please let me know if I should raise this in the bug tracker or if you need anything else.
Thanks,
Arlen