[FIRING:1] ignoring bridges by functionality

Total firing alerts: 1 Total resolved alerts: 0 ## Firing Alerts ----- Time: 2023-11-11 23:12:29.934 +0000 UTC Summary: Too many bridges are dysfuntional Description: The fraction of functional bridges is too low for rdsys ----- ##Resolved Alerts

On Sat, Nov 11, 2023 at 11:13:00PM +0000, alertmanager@hetzner-nbg1-02.torproject.org wrote:
## Firing Alerts
----- Time: 2023-11-11 23:12:29.934 +0000 UTC Summary: Too many bridges are dysfuntional Description: The fraction of functional bridges is too low for rdsys
I went to look at bridgestrap right after this alert, and bridgestrap seems to be doing fine. So I am wondering how to debug it on the rdsys side -- to understand which bridges it is considering, and which ones it thinks are down and why -- but I don't know how to. I added a comment to https://gitlab.torproject.org/tpo/anti-censorship/rdsys/-/issues/177 as a poor substitute. :) --Roger

Quoting Roger Dingledine (2023-11-12 00:42:54)
On Sat, Nov 11, 2023 at 11:13:00PM +0000, alertmanager@hetzner-nbg1-02.torproject.org wrote:
## Firing Alerts
----- Time: 2023-11-11 23:12:29.934 +0000 UTC Summary: Too many bridges are dysfuntional Description: The fraction of functional bridges is too low for rdsys
I went to look at bridgestrap right after this alert, and bridgestrap seems to be doing fine. So I am wondering how to debug it on the rdsys side -- to understand which bridges it is considering, and which ones it thinks are down and why -- but I don't know how to. I added a comment to https://gitlab.torproject.org/tpo/anti-censorship/rdsys/-/issues/177 as a poor substitute. :)
Yes, I think adding that information will be useful to debug it. And I'm planning to work on bridgestrap this week, I hope to come along to do it. I see this problem is usually appearing for a short period of time, ~30min that is the period of rdsys between scans on the bridge descriptors. It does happen when there is a restart on either rdsys or bridgestrap, but also sometimes on other situation that I haven't identified. I propose modifying the alert, so is only triggered if the problem is at least for 1h, I think is fine to ignore this problem if is just for 30mins there: https://gitlab.torproject.org/tpo/tpa/prometheus-alerts/-/merge_requests/38 -- meskio | https://meskio.net/ -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- My contact info: https://meskio.net/crypto.txt -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- Nos vamos a Croatan.
participants (3)
-
alertmanager@hetzner-nbg1-02.torproject.org
-
meskio
-
Roger Dingledine