While experimenting with another tunnel built on KCP and smux, I discovered that performance could be greatly increased by increasing the size of smux buffers. It's likely that doing the same can also improve performance in Snowflake.
There are two relevant parameters, MaxReceiveBuffer and MaxStreamBuffer. MaxStreamBuffer seems to be the most important one to increase.
https://pkg.go.dev/github.com/xtaci/smux#Config // MaxReceiveBuffer is used to control the maximum // number of data in the buffer pool MaxReceiveBuffer int
// MaxStreamBuffer is used to control the maximum // number of data per stream MaxStreamBuffer int
The default values are 4 MB and 64 KB. https://github.com/xtaci/smux/blob/eba6ee1d2a14eb7f86f43f7b7cb3e44234e13c66/... MaxReceiveBuffer: 4194304, MaxStreamBuffer: 65536,
kcptun, a prominent KCP/smux tunnel, has defaults of 4 MB (--smuxbuf) and 2 MB (--streambuf) in both client and server: https://github.com/xtaci/kcptun/blob/9a5b31b4706aba4c67bcb6ebfe108fdb564a905...
In my experiment, I changed the values to 4 MB / 1 MB on the client and 16 MB / 1 MB on the server. This change increased download speed by about a factor of 3: default buffers 477.4 KB/s enlarged buffers 1388.3 KB/2 Values of MaxStreamBuffer higher than 1 MB didn't seem to have much of an effect. 256 KB did not help as much.
My guess, based on intuition, is that on the server we should set a large value of MaxReceiveBuffer, as it is a global limit shared among all clients, and a relatively smaller value of MaxStreamBuffer, because there are expected to be many simultaneous streams. On the client, don't set MaxReceiveBuffer too high, because it's on an end-user device, but go ahead and set MaxStreamBuffer high, because there's expected to be only one or two streams at a time.
I discovered this initially by temporarily settings smux to protocol v1 instead of v2. My understanding is that v1 lacks some kind of receive window mechanism that v2 has, and by default is more willing to expend memory receiving data. See "Per-stream sliding window to control congestion.(protocol version 2+)": https://pkg.go.dev/github.com/xtaci/smux#readme-features
Past performance ticket: "Reduce KCP bottlenecks for Snowflake" https://gitlab.torproject.org/tpo/anti-censorship/pluggable-transports/snowf...
On 2021-06-30 8:43 p.m., David Fifield wrote:
While experimenting with another tunnel built on KCP and smux, I discovered that performance could be greatly increased by increasing the size of smux buffers. It's likely that doing the same can also improve performance in Snowflake.
There are two relevant parameters, MaxReceiveBuffer and MaxStreamBuffer. MaxStreamBuffer seems to be the most important one to increase.
https://pkg.go.dev/github.com/xtaci/smux#Config // MaxReceiveBuffer is used to control the maximum // number of data in the buffer pool MaxReceiveBuffer int
// MaxStreamBuffer is used to control the maximum // number of data per stream MaxStreamBuffer int
The default values are 4 MB and 64 KB. https://github.com/xtaci/smux/blob/eba6ee1d2a14eb7f86f43f7b7cb3e44234e13c66/... MaxReceiveBuffer: 4194304, MaxStreamBuffer: 65536,
kcptun, a prominent KCP/smux tunnel, has defaults of 4 MB (--smuxbuf) and 2 MB (--streambuf) in both client and server: https://github.com/xtaci/kcptun/blob/9a5b31b4706aba4c67bcb6ebfe108fdb564a905...
In my experiment, I changed the values to 4 MB / 1 MB on the client and 16 MB / 1 MB on the server. This change increased download speed by about a factor of 3: default buffers 477.4 KB/s enlarged buffers 1388.3 KB/2 Values of MaxStreamBuffer higher than 1 MB didn't seem to have much of an effect. 256 KB did not help as much.
My guess, based on intuition, is that on the server we should set a large value of MaxReceiveBuffer, as it is a global limit shared among all clients, and a relatively smaller value of MaxStreamBuffer, because there are expected to be many simultaneous streams. On the client, don't set MaxReceiveBuffer too high, because it's on an end-user device, but go ahead and set MaxStreamBuffer high, because there's expected to be only one or two streams at a time.
I discovered this initially by temporarily settings smux to protocol v1 instead of v2. My understanding is that v1 lacks some kind of receive window mechanism that v2 has, and by default is more willing to expend memory receiving data. See "Per-stream sliding window to control congestion.(protocol version 2+)": https://pkg.go.dev/github.com/xtaci/smux#readme-features
Past performance ticket: "Reduce KCP bottlenecks for Snowflake" https://gitlab.torproject.org/tpo/anti-censorship/pluggable-transports/snowf... _______________________________________________ anti-censorship-team mailing list anti-censorship-team@lists.torproject.org https://lists.torproject.org/cgi-bin/mailman/listinfo/anti-censorship-team
This is a great find!
I dug into the code a little bit to see how these values are used, and here's a summary of what I found:
MaxReceiveBuffer limits the amount of data read into a buffer for each smux.Session. Here's the relevant library code: https://github.com/xtaci/smux/blob/eba6ee1d2a14eb7f86f43f7b7cb3e44234e13c66/... Relevant Snowflake server code: https://gitlab.torproject.org/tpo/anti-censorship/pluggable-transports/snowf... Relevant Snowflake client code: https://gitlab.torproject.org/tpo/anti-censorship/pluggable-transports/snowf...
This value is not advertized to the other endpoint in any way and therefore does not directly affect the amount of in-flight traffic. It's used to limit the size of a session's buffer which holds data read in from the underlying connection (in this case a KCP connection), while it waits for Read to be called on any of its streams.
I think there is a 1:1 relationship between smux.Sessions and KCP connections, making this also a per-client value and not a global limit. My intuition is that changing it will improve performance if we're running into CPU limits and are unable to Read the data out of smux.Streams quickly enough, resulting in dropped packets and retransmissions because the in-flight data from the other endpoint is waiting too long to be read in by the smux.Session. So changing it at the client might indeed help, but increasing the processing power (CPU) of the server might also help address the same underlying issue. We recently doubled the number of CPU cores for the Snowflake server: https://gitlab.torproject.org/tpo/anti-censorship/pluggable-transports/snowf...
MaxStreamBuffer on the other hand *does* directly limit the amount of in-flight traffic, because it is sent to the other endpoint in a window update: https://github.com/xtaci/smux/blob/eba6ee1d2a14eb7f86f43f7b7cb3e44234e13c66/... The other endpoint will not send additional data if its calculation of the amount of inflight data is greater than or equal to this value.
Since client-server connections are abnormally long (the smux.Stream/Session data is traversing the distance between the client and the proxy + the distance between the proxy and the server), it makes sense that increasing the MaxStreamBuffer will improve performance.
It also occurs to me that the kcp.MaxWindowSize has to be at least as big as the MaxStreamBuffer size to notice any improvements, otherwise that will be the limiting factor on the amount of inflight data. Right now this is set to 64KB for both the client and the server.
I starting doing a few quick performance tests by just modifying the client. This should be enough to check the impact of tuning the MaxStreamBuffer and MaxReceiveBuffer for download speeds. But, because the KCP MaxWindowSize is both a send and receive window, as expected I didn't see any difference in performance without increasing this value at both the client and server first (my results for each of the test cases I ran were download speeds of 200-500KB/s).
My proposal is to set the KCP MaxWindowSize to 4MB and smux MaxStreamBuffer to 1MB at both the client and the server and deploy these changes. Then, we can try tuning these values at the client side to test the impact of MaxStreamBuffer sizes up to 4MB for download speeds.
Cecylia
On Wed, Jul 14, 2021 at 03:20:31PM -0400, Cecylia Bocovich wrote:
I starting doing a few quick performance tests by just modifying the client. This should be enough to check the impact of tuning the MaxStreamBuffer and MaxReceiveBuffer for download speeds.
Last weekend Jacobo and me were playing a little bit with tweaking the buffer sizes on the client. I kind of observe some patterns, but after a few runs I suspected that I was catching more temporal network variations than any buffer effects.
https://0xacab.org/kali/snowflake-metrics/-/blob/no-masters/data/Rplots.pdf https://0xacab.org/kali/snowflake-metrics/-/blob/no-masters/data/maxstreambu...
I wanted to measure too the number of snowflakes needed to complete a single download, to try to isolate the high variability across different runs. I might be wrong, but in a quick exploration the time-to-bootstrap seemed to be significant as a predictor of the download rate together with the buffer size.
Before continuing with these tests: is it advisable to run this kind of tests against the working infrastructure, or should I better setup snowbox in order to parametrize any changes? (ideally, simulating real-world latencies).
On 2021-07-14 4:01 p.m., Kali Kaneko wrote:
On Wed, Jul 14, 2021 at 03:20:31PM -0400, Cecylia Bocovich wrote:
I starting doing a few quick performance tests by just modifying the client. This should be enough to check the impact of tuning the MaxStreamBuffer and MaxReceiveBuffer for download speeds.
Last weekend Jacobo and me were playing a little bit with tweaking the buffer sizes on the client. I kind of observe some patterns, but after a few runs I suspected that I was catching more temporal network variations than any buffer effects.
https://0xacab.org/kali/snowflake-metrics/-/blob/no-masters/data/Rplots.pdf https://0xacab.org/kali/snowflake-metrics/-/blob/no-masters/data/maxstreambu...
Hi!
This is interesting, thanks for sharing! Yes, I suspect most of the variations you're seeing in these tests are due to network effects, and the different throughput and latency of the snowflake proxy you happen to be connecting to for that run. This matches the results of my own tests.
I think the lack of a clear pattern in the results is due to another bottleneck in KCP that will limit any performance improvements that we might get in the smux configuration. Once this patch is merged, and both the client and server's KCP window sizes are increased, we might have more luck: https://gitlab.torproject.org/tpo/anti-censorship/pluggable-transports/snowf...
I wanted to measure too the number of snowflakes needed to complete a single download, to try to isolate the high variability across different runs. I might be wrong, but in a quick exploration the time-to-bootstrap seemed to be significant as a predictor of the download rate together with the buffer size.
Can you elaborate a bit on what you mean by the number of snowflakes needed?
The time to bootstrap a Tor connection to 100% will indeed be a good predictor of the download speed, because all of the network messages needed to bootstrap a full Tor circuit are sent over the same Snowflake connection that you are using to download files or browse the web. So, if you received a slow/low bandwidth/high latency snowflake proxy, both the bootstrap and the download will be slow.
Before continuing with these tests: is it advisable to run this kind of tests against the working infrastructure, or should I better setup snowbox in order to parametrize any changes? (ideally, simulating real-world latencies).
Thanks for asking! In general it's okay to run client-side tests on real-world snowflake as long as you're not overwhelming the network. At this point we don't have a lot of clients connecting to snowflake and both the bridge and broker are doing well at handling the current load so you're probably fine. It's a good idea to space out your tests anyway for scientific validity purposes. If, for example, you want to make 100 snowflake connections, you could make 10 connections per hour or 25 connections every 6 hours, rather than doing them all at once immediately sequentially or in parallel.
If you're unsure about the impact of your tests on the system feel free to reach out here or on IRC as well :) it's awesome that you're doing research on the snowflake network and we'd like to help if we can!
Cecylia
On Wed, Jul 14, 2021 at 10:01:28PM +0200, Kali Kaneko wrote:
On Wed, Jul 14, 2021 at 03:20:31PM -0400, Cecylia Bocovich wrote:
I starting doing a few quick performance tests by just modifying the client. This should be enough to check the impact of tuning the MaxStreamBuffer and MaxReceiveBuffer for download speeds.
Last weekend Jacobo and me were playing a little bit with tweaking the buffer sizes on the client. I kind of observe some patterns, but after a few runs I suspected that I was catching more temporal network variations than any buffer effects.
https://0xacab.org/kali/snowflake-metrics/-/blob/no-masters/data/Rplots.pdf https://0xacab.org/kali/snowflake-metrics/-/blob/no-masters/data/maxstreambu...
Thanks for testing this. Indeed the download speed look uncorrelated with MaxStreamBuffer.
One thing you might try is also increasing the turbotunnel queueSize (try a value of ≈1024): https://gitweb.torproject.org/pluggable-transports/snowflake.git/tree/common... For the reasons explained here: https://gitlab.torproject.org/tpo/anti-censorship/pluggable-transports/snowf... At high transfer rates, packets may be getting dropped internally, which requires retransmission and slows everything down. Increasing queueSize is probably more important on the bridge side, but try on the client side as well.
On Wed, Jun 30, 2021 at 06:43:03PM -0600, David Fifield wrote:
While experimenting with another tunnel built on KCP and smux, I discovered that performance could be greatly increased by increasing the size of smux buffers. It's likely that doing the same can also improve performance in Snowflake.
I spent a good deal of time trying to see if adjusting the size of smux buffers could also improve the performance of dnstt. The full information is here:
https://www.bamsoftware.com/software/dnstt/performance.html#download-2021080...
I had mixed results. I initially had promising results when connecting directly to the dnstt server using plaintext UDP DNS, without a recursive resolver. But the configurations with a recursive resolver proved difficult to optimize.
These are the parameters I settled on: * smuxConfig.MaxStreamBuffer = 1 * 1024 * 1024 (both client and server) * KCP conn.SetWindowSize(64, 64) * turbotunnel.queueSize = 128 You can see numbers from my brief parameter search in the commit message: https://repo.or.cz/dnstt.git/patch/de15c5a51291cae19dfad26149f00b2b836edfb3
anti-censorship-team@lists.torproject.org