On 2021-06-30 8:43 p.m., David Fifield wrote:
While experimenting with another tunnel built on KCP and smux, I discovered that performance could be greatly increased by increasing the size of smux buffers. It's likely that doing the same can also improve performance in Snowflake.
There are two relevant parameters, MaxReceiveBuffer and MaxStreamBuffer. MaxStreamBuffer seems to be the most important one to increase.
https://pkg.go.dev/github.com/xtaci/smux#Config // MaxReceiveBuffer is used to control the maximum // number of data in the buffer pool MaxReceiveBuffer int
// MaxStreamBuffer is used to control the maximum // number of data per stream MaxStreamBuffer int
The default values are 4 MB and 64 KB. https://github.com/xtaci/smux/blob/eba6ee1d2a14eb7f86f43f7b7cb3e44234e13c66/... MaxReceiveBuffer: 4194304, MaxStreamBuffer: 65536,
kcptun, a prominent KCP/smux tunnel, has defaults of 4 MB (--smuxbuf) and 2 MB (--streambuf) in both client and server: https://github.com/xtaci/kcptun/blob/9a5b31b4706aba4c67bcb6ebfe108fdb564a905...
In my experiment, I changed the values to 4 MB / 1 MB on the client and 16 MB / 1 MB on the server. This change increased download speed by about a factor of 3: default buffers 477.4 KB/s enlarged buffers 1388.3 KB/2 Values of MaxStreamBuffer higher than 1 MB didn't seem to have much of an effect. 256 KB did not help as much.
My guess, based on intuition, is that on the server we should set a large value of MaxReceiveBuffer, as it is a global limit shared among all clients, and a relatively smaller value of MaxStreamBuffer, because there are expected to be many simultaneous streams. On the client, don't set MaxReceiveBuffer too high, because it's on an end-user device, but go ahead and set MaxStreamBuffer high, because there's expected to be only one or two streams at a time.
I discovered this initially by temporarily settings smux to protocol v1 instead of v2. My understanding is that v1 lacks some kind of receive window mechanism that v2 has, and by default is more willing to expend memory receiving data. See "Per-stream sliding window to control congestion.(protocol version 2+)": https://pkg.go.dev/github.com/xtaci/smux#readme-features
Past performance ticket: "Reduce KCP bottlenecks for Snowflake" https://gitlab.torproject.org/tpo/anti-censorship/pluggable-transports/snowf... _______________________________________________ anti-censorship-team mailing list anti-censorship-team@lists.torproject.org https://lists.torproject.org/cgi-bin/mailman/listinfo/anti-censorship-team
This is a great find!
I dug into the code a little bit to see how these values are used, and here's a summary of what I found:
MaxReceiveBuffer limits the amount of data read into a buffer for each smux.Session. Here's the relevant library code: https://github.com/xtaci/smux/blob/eba6ee1d2a14eb7f86f43f7b7cb3e44234e13c66/... Relevant Snowflake server code: https://gitlab.torproject.org/tpo/anti-censorship/pluggable-transports/snowf... Relevant Snowflake client code: https://gitlab.torproject.org/tpo/anti-censorship/pluggable-transports/snowf...
This value is not advertized to the other endpoint in any way and therefore does not directly affect the amount of in-flight traffic. It's used to limit the size of a session's buffer which holds data read in from the underlying connection (in this case a KCP connection), while it waits for Read to be called on any of its streams.
I think there is a 1:1 relationship between smux.Sessions and KCP connections, making this also a per-client value and not a global limit. My intuition is that changing it will improve performance if we're running into CPU limits and are unable to Read the data out of smux.Streams quickly enough, resulting in dropped packets and retransmissions because the in-flight data from the other endpoint is waiting too long to be read in by the smux.Session. So changing it at the client might indeed help, but increasing the processing power (CPU) of the server might also help address the same underlying issue. We recently doubled the number of CPU cores for the Snowflake server: https://gitlab.torproject.org/tpo/anti-censorship/pluggable-transports/snowf...
MaxStreamBuffer on the other hand *does* directly limit the amount of in-flight traffic, because it is sent to the other endpoint in a window update: https://github.com/xtaci/smux/blob/eba6ee1d2a14eb7f86f43f7b7cb3e44234e13c66/... The other endpoint will not send additional data if its calculation of the amount of inflight data is greater than or equal to this value.
Since client-server connections are abnormally long (the smux.Stream/Session data is traversing the distance between the client and the proxy + the distance between the proxy and the server), it makes sense that increasing the MaxStreamBuffer will improve performance.
It also occurs to me that the kcp.MaxWindowSize has to be at least as big as the MaxStreamBuffer size to notice any improvements, otherwise that will be the limiting factor on the amount of inflight data. Right now this is set to 64KB for both the client and the server.
I starting doing a few quick performance tests by just modifying the client. This should be enough to check the impact of tuning the MaxStreamBuffer and MaxReceiveBuffer for download speeds. But, because the KCP MaxWindowSize is both a send and receive window, as expected I didn't see any difference in performance without increasing this value at both the client and server first (my results for each of the test cases I ran were download speeds of 200-500KB/s).
My proposal is to set the KCP MaxWindowSize to 4MB and smux MaxStreamBuffer to 1MB at both the client and the server and deploy these changes. Then, we can try tuning these values at the client side to test the impact of MaxStreamBuffer sizes up to 4MB for download speeds.
Cecylia