On 18 Nov (10:01:09), Arlen Yaroslav via tor-relays wrote:
Some folks might consider switching to non-exit nodes to just get rid of
the overload message. Please bear with us while we are debugging the
problem and don't do that. :) We'll keep this list in the loop.
The undocumented configuration option 'OverloadStatistics' can be used to disable the reporting of an overloaded state. E.g. place the following in your torrc:
OverloadStatistics 0
May be worth considering until the reporting feature becomes a bit more mature and the issues around DNS resolution become a bit clearer.
Greetings everyone!
We wanted to follow up with all of you on this. It has been a while but we finally got down to the problem.
We made this ticket public which is where we pulled together the information we had from Exit operators helping us in private:
https://gitlab.torproject.org/tpo/network-health/team/-/issues/139
You can find here the summary of the problem: https://gitlab.torproject.org/tpo/network-health/team/-/issues/139#note_2764...
The gist is that tor imposes a 5 seconds timeout basically dictating libevent to give up on the DNS resolve after 5 seconds. And it will do that 3 times before an error is returned to tor.
That very error is a "DNS TIMEOUT" which is what we expose on the MetricsPort and also use for the overload general indicator.
The problem lies with that very error. It is in fact _not_ a "real" DNS timeout but rather just "took too long for the parameters I have". So these timeouts should more be seen as a "UX issue" rather than "network issue".
For that reason, we will remove the DNS timeout from the overload general indicator and we will rename also the "dns timeout" metrics on the MetricsPort to something with a more meaningful name.
Operators can still use the DNS metrics to monitor health of the DNS by looking at all other possible errors especially "serverfailed".
Finally, we will most likely also bring down the Tor DNS timeout from 5 seconds to 1 seconds in order to improve UX:
https://gitlab.torproject.org/tpo/core/tor/-/issues/40312
We will likely fix this the current 0.4.7.x development version and backport it into 0.4.6 stable. Release time line is to come but we hope as soon as possible.
Thanks everyone for your help, feedback and patience with this problem! In particular, thanks a lot to Anders Trier for their help and providing us with an Exit relay we could experiment with and toralf for providing so much useful information from their relays.
Cheers! David