Greetings,
Earlier this month, many relay operators started noticing huge loads on their relays both in terms of traffic and memory consumption leading to relays malfunctionning or even dying in some cases.
We've started looking at this in depth in the last few days. It turns out that many relays (not all) are under a distributed denial of service (DDoS) attack which makes them use a lot of memory ultimately making the operating system stop the process or becoming unreliable because of the resource pressure.
This has lead to some relays to restart, being shutdown or becoming so unstable that they would fall in and out of the network. You can see here on the Metrics portal the consequences of this ongoing attack:
https://metrics.torproject.org/relayflags.html?start=2017-09-20&end=2017...
Among other things, it is badly affecting relays with the HSDir flag because once they restart, it takes 96 hours before they get the flag back. This affects the reachability of hidden services and thus the UX of .onions.
We've been analyzing some relays being flooded to understand what is going on and how to fix it. The good news is that we are fairly confident that we know what is happening and we are currently testing some fixes to address the situation.
In the meantime, if your relay is under heavy memory pressure that is tor is taking a huge amount of RAM making your machine fail to operate properly, you can set the MaxMemInQueues option in your torrc file to a reasonable upper limit which limits the amount of memory used by tor. At least 2GB if you can for a fast relay is usually a good value for tor to operate properly and not degrading performance too much.
With this, if the memory usage reaches that limit, tor's OOM (Out Of Memory handler) will kick in and cleanup what it can. It is still possible that your relay goes above the limit, it is one of the thing we are currently investigating. However, it should not grow indefinitely.
Thanks everyone and we'll hopefully resolve the situation soon! David