[anti-censorship-team] Re: Snowflake bridges using SQS

26 Mar 2025

      On 2025-02-12 12:26, Cecylia Bocovich wrote:
...
On 2025-02-11 09:26, Michael Rogers via anti-censorship-team wrote:
...
On 11/02/2025 12:47, Michael Rogers via anti-censorship-team wrote:
...
On 07/02/2025 21:22, Cecylia Bocovich via anti-censorship-team wrote:
...
On 2025-02-07 12:22, Michael Rogers via anti-censorship-team wrote:
...
Hi all,
After updating Briar's bridge config to use the current settings 
from Moat, we're seeing two Snowflake bridges consistently failing 
in our CI tests. They're the two bridges that use SQS. Here's a 
snippet from the log:
```
    INFO: NOTICE Managed proxy "/builds/briar/onionwrapper/ 
onionwrapper-java/test.tmp/35/lyrebird": offer created
    Feb 04, 2025 1:12:34 PM 
org.briarproject.onionwrapper.AbstractTorWrapper message
    INFO: NOTICE Managed proxy "/builds/briar/onionwrapper/ 
onionwrapper-java/test.tmp/35/lyrebird": broker failure operation 
error SQS: GetQueueUrl, https response error StatusCode: 400, 
RequestID: 60e91cfa-a2a0-55db-beb0-7ce6b621d324, 
AWS.SimpleQueueService.NonExistentQueue: The specified queue does 
not exist.
```
Does the queue really not exist, or does this point to some other 
issue, like the bridges being geoIP restricted or the app needing 
to pass some extra information to the transport?
[snip]
Well, I should've waited before sending that message, because starting 
at 12:59 UTC the attempts to bootstrap via SQS bridges succeeded, with 
only one queue-related error being printed per boostrapping attempt.
Did anything change in the bridge config around that time, or could 
the queue errors be load-dependent?
Cheers,
Michael
Nothing changed as far as I'm aware. It seems likely to me that there is 
some external factor (like load) that is causing a lot of variation in 
how long it takes to create these queues.
I just remembered this open issue: https://gitlab.torproject.org/tpo/ 
anti-censorship/pluggable-transports/snowflake/-/issues/40363
I'm still not sure of the cause but it seems to be more likely to happen 
if two bridges are configured at the same time.
It turns out there were two main bugs with the SQS queue implementation 
that explain the queue creation errors you were seeing. These have both 
been fixed.

The first bug was a bottleneck that was preventing us from receiving 
messages from the broker queue quickly enough[0]. This explained why the 
failure rate was sometimes higher at times when there was more load on 
the system.

The second bug was a pointer reuse error that caused multiple 
simultaneous polls to be overwritten[1]. This was also likely to occur 
at times of increased load but could also be triggered by having more 
than one SQS bridge line.

We haven't had any issue with our budget limits lately. With these 
fixes, SQS is now perhaps the most reliable rendezvous channel.

[0] 
https://gitlab.torproject.org/tpo/anti-censorship/pluggable-transports/snowf...

[1] 
https://gitlab.torproject.org/tpo/anti-censorship/pluggable-transports/snowf...