On Wed, Aug 30, 2023 at 02:31:09PM -0400, Cecylia Bocovich wrote:
On 2023-08-30 14:28, Cecylia Bocovich wrote:
I was alerted by trinity-1686a on irc that Snowflake standalone proxy operators were reporting on #tor-relays about increased OOM errors from increased load as of 2023-08-28.
After looking at the Snowfake broker metrics[0], there's a huge jump in client polls (seen by summing the client-denied-count and client-snowflake-match-count).
I've attached a graph of the collected prometheus metrics that shows this spike happening at exactly 17:40 UTC on 2023-08-27. It looks like way too sharp an increase to me to be a censorship event, perhaps it is a DoS?
It's still too early to see the bridge metrics from the metrics page, but we should start to see the effects there tomorrow.
[0] https://metrics.torproject.org/collector.html#snowflake-stats
I should add that it looks to have stopped around 12:25 UTC earlier today (2023-08-30).
I was having a look at the graphs and I have an explanation. The snowflake-02 bridge was, for some reason, not processing any connections during this time. The times in the logs match up exactly with the client polls graph. It stopped working at 2023-08-27 17:45 and began working again at 2023-08-30 12:30.
2023/08/27 17:45:33 reading token: read tcp [scrubbed]->[scrubbed]: read: connection timed out (766 more "connection timed out" lines) 2023/08/27 17:47:28 reading token: read tcp [scrubbed]->[scrubbed]: read: connection timed out 2023/08/28 11:03:26 in the past 86400 s, 105514/105990 connections had client_ip 2023/08/29 11:03:26 in the past 86400 s, 0/0 connections had client_ip 2023/08/30 11:03:26 in the past 86400 s, 0/0 connections had client_ip 2023/08/30 12:30:14 reading token: websocket: close 1006 (abnormal closure): unexpected EOF (working again)
What would have happened is clients that randomly selected snowflake-02 as their bridge would have timed out and had to re-rendezvous until they happened to randomly select snowflake-01. Meanwhile snowflake-01 likely was overloaded because it alone is not able to handle all existing clients.
I don't know what went wrong with snowflake-02. The server did not reboot, and as far as I can tell all the processes kept running.