Quoting Roger Dingledine (2023-11-12 00:42:54)
On Sat, Nov 11, 2023 at 11:13:00PM +0000, alertmanager@hetzner-nbg1-02.torproject.org wrote:
## Firing Alerts
Time: 2023-11-11 23:12:29.934 +0000 UTC Summary: Too many bridges are dysfuntional Description: The fraction of functional bridges is too low for rdsys
I went to look at bridgestrap right after this alert, and bridgestrap seems to be doing fine. So I am wondering how to debug it on the rdsys side -- to understand which bridges it is considering, and which ones it thinks are down and why -- but I don't know how to. I added a comment to https://gitlab.torproject.org/tpo/anti-censorship/rdsys/-/issues/177 as a poor substitute. :)
Yes, I think adding that information will be useful to debug it. And I'm planning to work on bridgestrap this week, I hope to come along to do it.
I see this problem is usually appearing for a short period of time, ~30min that is the period of rdsys between scans on the bridge descriptors. It does happen when there is a restart on either rdsys or bridgestrap, but also sometimes on other situation that I haven't identified.
I propose modifying the alert, so is only triggered if the problem is at least for 1h, I think is fine to ignore this problem if is just for 30mins there: https://gitlab.torproject.org/tpo/tpa/prometheus-alerts/-/merge_requests/38