On Thu, Feb 23, 2023 at 07:27:56PM +0100, meskio wrote:
bridgestrap is recovering, now it is claiming that 85% of functional bridges. I don't know what was the source of the problem, maybe we had some network issues in polyanthum?
There were overload issues around that time on the metal that runs various VMs like check.tpo and bridges.tpo.
So, it would seem that bridgestrap has some bugs where if the network goes away, or if the disk or cpu becomes too loaded and things stall, it calls a lot of bridges down.
How to make things more robust? Hm. One answer might be running two bridgestraps in different places and ignoring one if it says a lot of bridges went down but the other doesn't agree.
I was originally thinking to have a handful of bridges that we *know* are usually mostly up, like the built-in bridges, and if all of those are suddenly down, we stop believing bridgestrap's answer. But then we end up in the situation where all we know is that we don't know.
I guess a third idea would be to ignore it since it doesn't happen *that* often (though it seems to happen more often in our current age of DDoS attacks).
Hopefully there are better ideas out there and how to best handle or tolerate or work around an overload on the underlying bridgestrap server. :)
--Roger