[anti-censorship-team] Spike in client polls from Snowflake broker metrics -- caused by outage of snowflake-02

David Fifield david at bamsoftware.com
Mon Sep 4 07:30:22 UTC 2023


On Wed, Aug 30, 2023 at 02:31:09PM -0400, Cecylia Bocovich wrote:
> On 2023-08-30 14:28, Cecylia Bocovich wrote:
> > I was alerted by trinity-1686a on irc that Snowflake standalone proxy
> > operators were reporting on #tor-relays about increased OOM errors from
> > increased load as of 2023-08-28.
> > 
> > After looking at the Snowfake broker metrics[0], there's a huge jump in
> > client polls (seen by summing the client-denied-count and
> > client-snowflake-match-count).
> > 
> > I've attached a graph of the collected prometheus metrics that shows
> > this spike happening at exactly 17:40 UTC on 2023-08-27. It looks like
> > way too sharp an increase to me to be a censorship event, perhaps it is
> > a DoS?
> > 
> > It's still too early to see the bridge metrics from the metrics page,
> > but we should start to see the effects there tomorrow.
> > 
> > [0] https://metrics.torproject.org/collector.html#snowflake-stats
> 
> I should add that it looks to have stopped around 12:25 UTC earlier today
> (2023-08-30).

I was having a look at the graphs and I have an explanation. The
snowflake-02 bridge was, for some reason, not processing any connections
during this time. The times in the logs match up exactly with the client
polls graph. It stopped working at 2023-08-27 17:45 and began working
again at 2023-08-30 12:30.

2023/08/27 17:45:33 reading token: read tcp [scrubbed]->[scrubbed]: read: connection timed out
(766 more "connection timed out" lines)
2023/08/27 17:47:28 reading token: read tcp [scrubbed]->[scrubbed]: read: connection timed out
2023/08/28 11:03:26 in the past 86400 s, 105514/105990 connections had client_ip
2023/08/29 11:03:26 in the past 86400 s, 0/0 connections had client_ip
2023/08/30 11:03:26 in the past 86400 s, 0/0 connections had client_ip
2023/08/30 12:30:14 reading token: websocket: close 1006 (abnormal closure): unexpected EOF
(working again)

What would have happened is clients that randomly selected snowflake-02
as their bridge would have timed out and had to re-rendezvous until they
happened to randomly select snowflake-01. Meanwhile snowflake-01 likely
was overloaded because it alone is not able to handle all existing
clients.

I don't know what went wrong with snowflake-02. The server did not
reboot, and as far as I can tell all the processes kept running.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: snowflake-userstats-bridge-transport-202308.png
Type: image/png
Size: 83337 bytes
Desc: not available
URL: <http://lists.torproject.org/pipermail/anti-censorship-team/attachments/20230904/b945c0a2/attachment-0003.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: snowflake-top-countries-202308.png
Type: image/png
Size: 212533 bytes
Desc: not available
URL: <http://lists.torproject.org/pipermail/anti-censorship-team/attachments/20230904/b945c0a2/attachment-0004.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: snowflake-bandwidth-202308.png
Type: image/png
Size: 79822 bytes
Desc: not available
URL: <http://lists.torproject.org/pipermail/anti-censorship-team/attachments/20230904/b945c0a2/attachment-0005.png>


More information about the anti-censorship-team mailing list