David, This is exactly the type of information I was hoping for. You should make this an article and link it to the overloaded support page. I guess I assumed that Tor preformed external timeout monitoring apposed to relay reported resource monitoring. It's interesting that you mention loadbalancing Tor as that is precisely what my recent efforts have been geared toward. I'm fairly confident that my last overloaded state was due to migrating one of my Tor relay nodes onto a previously provisioned BotFarm node and forgetting to kill the existing bot processes; thus, having competing resources. I can confirm that when loadbalancing Tor relay nodes that the whole is only as good as the weakest link; thus, it's important to have identical Tor relay nodes to evenly distribute circuits and maintain consensus. In this paradigm, I was hoping to be able to define a timeout value associated with the overloaded state and tune the loadbalancer to redistribute to different upstream nodes should a Tor relay node reach such a value. However, it seems this is a moot point, after reading your summary of the reporting process. At present, I have the upstream, loadbalancing timeout values disabled and let the Tor nodes build or teardown circuits based on available resources per node. I do see spikes alternate through various nodes throughout the day. It would be nice to find an upstream timeout value to better manage those spikes. Any recommendations would be greatly appreciated. Respectfully,
Gary P.S. This is all being done on ASUSWRT-Merlin using AiMesh nodes, but isn't limited to that architecture. I hope to publish a tutorial, after ironing out all the kinks.
On Tuesday, September 28, 2021, 7:01:04 AM MDT, David Goulet dgoulet@torproject.org wrote:
On 27 Sep (14:23:34), Gary C. New via tor-relays wrote:
George, The referenced support article provides recommendations as to what might be causing the overloaded state, but it doesn't provide the metric(s) for how Tor decides whether a relay is overloaded. I'm trying to ascertain the later. I would assume the overloaded state metric(S) is/are a maximum timeout value and/or reoccurrence value, etc. By knowing what the overloaded state metric is, I can tune my Tor relay just short of it. Thank you for your reply. Respectfully,
Hi Gary!
I'll try to answer the best I can from what we've have worked on for these overload metrics.
Essentially, there few places within a Tor relay that we can easily notice an "overloaded" state. I'll list them and tell you how we decide:
1. Out-Of-Memory invocation
Tor has its own OOM and it is invoked when 75% of the total memory tor thinks it can use is reached. Thus, let say tor thinks it can use 2GB in total then at 1.5GB of memory usage, it will start freeing memory. That, is considered an overload state.
Now the real question here is what is the memory "tor thinks" it has. Unfortunately, it is not the greatest estimation but that is what it is. When tor starts, it will use MaxMemInQueues for that value or will look at the total RAM available on the system and apply this algorithm:
if RAM >= 8GB { memory = RAM * 40% } else { memory = RAM * 75% } /* Capped. */ memory = min(memory, 8GB) -> [8GB on 64bit and 2GB on 32bit) /* Minimum value. */ memory = max(250MB, memory)
Why we picked those numbers, I can't tell you that, these come from the very early days of the tor software and I can't tell you why.
And so to avoid such overload state, clearly run a relay above 2GB of RAM on 64bit should be the bare minimum in my opinion. 4GB would be much much better. In DDoS circumstances, there is a whole lot of memory pressure.
One keen observer can notice that this approach also has the problem that it doesn't shield tor from being called by the OS OOM itself. Reason is that because we take the total memory on the system when tor starts, if the overall system has many other applications running using RAM, we end up eating too much memory and the OS could OOM tor without tor even noticing memory pressure. Fortunately, this is not a problem affecting the overload status situation.
2. Onionskins processing
Tor is sadly single threaded _except_ for when the "onion skins" are processed that is the cryptographic work that needs to be done on the famous "onion layers" in every circuits.
For that we have a thread pool and outsource all of that work to that pool. It can happen that this pool starts dropping work due to back pressure and that in turn is an overload state.
Why this can happen, essentially CPU pressure. If your server is running at capacity and it is not only your tor, then this is likely to trigger.
3. DNS Timeout
This applies only to Exits. If tor starts noticing DNS timeouts, you'll get the overload flag. This might not be because your relay is overloaded in terms of resources but it signals a problem on the network.
And DNS timeouts at the Exits are a _huge_ UX problem for tor users and so Exit operators really need to be on top of those to help. It is not clear with the overload line but at least if an operator notices the overload line, it can then investigate DNS timeouts in case there is no resources pressure.
4. TCP port exhaustion
This should be extremely rare though. The idea about this one is that you ran out of TCP ports which is a range that is usually, on Linux, 32768-60999 and so having that many connections would lead to the overload state.
However, I think (I might be wrong though) that nowadays this range is per source IP and not process wide so it would likely have to be deliberate from someone to put your relay in that state.
There are two other overload lines that tor relay report: "overload-ratelimits" and "overload-fd-exhausted" but they are not used yet for the overload status on Metrics. But you can find them in your relay descriptor[0] if you are curious.
They are about when your relay reaches its connection global limit too often and when your relay runs out of file descriptors.
Hope this helps but overall as you can see, a lot of factor can influence these metrics and so the ideal ideal ideal situation for a tor relay is that it runs alone on a fairly good machine. Any kinds of pullback from a tor relay like being overloaded has cascading effects on the network both in terms of UX but also in terms of load balancing which tor is not yet very good at (but we are working on hard on making it much better!!).
Cheers! David
[0]
https://collector.torproject.org/recent/relay-descriptors/server-descriptors...