[anti-censorship-team] snowflake-01 bridge performance tuning and optimization
david at bamsoftware.com
Tue Oct 4 07:17:48 UTC 2022
Linus and I have been putting in a lot of time on the Snowflake bridge
over the past week or so to improve performance. The increase in users
after the blocking of Tor in Russia last September (the one that led us
to the multi-tor architecture) was large, but this recent increase is
many times larger. We've cleared out the major bottlenecks and since two
days ago days the bridge is finally meeting the additional demand. But
it's close: during the busiest times of day the CPU and RAM resources
are nearly 100% consumed, and the need for horizontal scaling still
There have been a lot of changes, both major and minor. You can find
The most important optimizations have been:
- Reduce allocations in WebSocket reads and writes.
- Use more than one KCP state machine.
- Conserve ephemeral ports in localhost communication.
- Disable nftables connection tracking.
I attached a graph of bandwidth on the bridge. You can see that between
24 Sep and 02 Oct the daily peaks were unnaturally flattened. The daily
shutdowns in Iran caused a paradoxical *increase* in bandwidth, probably
because it relieved congestion in the system for the fewer remaining
users. In the past two days the shape of the graph looks more natural,
and a shutdown decreases bandwidth as you would expect. At its peak,
bandwidth is above 4 Gbps. The daily lows are higher than the highest
highs of two weeks ago.
## Reduce allocations in writing packets
The code forreading and writing encapsulated packets from WebSocket
connections was doing unnecessary memory allocations. Some implicit,
like a 32 KB buffer being created by io.Copy, and some explicit, like
the intentional packet copies being made in the QueuePacketConn type.
Reducing allocations makes the garbage collector run less often, and
there's also a small benefit from reduced buffer copies.
## Use more than one KCP state machine
Though most part of snowflake-server are multi-threaded and can scale
across many CPUs, the central KCP packet scheduler was limited to one
CPU. Because we have a session identity (the client ID) separate from
any KCP-specific identity, it's not hard to partition client packets
across separate KCP instances by a hash of the client ID. Expanding the
number of KCPs from 1 to 2 was enough to relive this bottleneck.
## Conserve ephemeral ports in localhost communication
The pluggable transports model relies heavily on localhost TCP sockets.
The number of users had increased enough that it was sometimes
exhausting the range of port numbers usable for distinct localhost
4-tuples. The kernel's errno for this situation is EADDRNOTAVAIL when
you try to connect; it manifests variously in different programs as
"cannot assign requested address" or "no free ports," generally leading
to a terminated connection. We mitigated this problem by having
different programs use different localhost source IP addresses (e.g.
127.0.1.0/24, 127.0.2.0/24, etc.) to expand the space of distinct
4-tuples. In a couple of cases this was done in a hacky way that will
need to be revisited, by hardcoding a source address range in the source
## Disable nftables connection tracking
We took care of it before it became a problem, but we found it necessary
to disable connection tracking in the firewall. The number of tracked
connections was getting close to the limit, and past the limit, packets
just get dropped.
-------------- next part --------------
A non-text attachment was scrubbed...
Size: 117738 bytes
Desc: not available
More information about the anti-censorship-team