On Mon, Sep 26, 2022 at 10:39:42AM +0200, Linus Nordberg via anti-censorship-team wrote:
It seems likely that we're hitting a limit of some sort and next thing is to figure out if it's a soft limit that we can influence through system configuration or if it's a hardware resource limit.
tor has a default bandwidth limit, but we should be nowhere close to it, especially disitributed across 12 instances:
BandwidthRate N bytes|KBytes|MBytes|GBytes|TBytes|KBits|MBits|GBits|TBits A token bucket limits the average incoming bandwidth usage on this node to the specified number of bytes per second, and the average outgoing bandwidth usage to that same value. If you want to run a relay in the public network, this needs to be at the very least 75 KBytes for a relay (that is, 600 kbits) or 50 KBytes for a bridge (400 kbits) — but of course, more is better; we recommend at least 250 KBytes (2 mbits) if possible. (Default: 1 GByte)
I do not see any rate limit enabled in /etc/haproxy/haproxy.cfg.
I checked the number of sockets connected to the haproxy frontend port, thinking that we may be running out of localhost 4-tuples. It's still in bounds (but we may have to figure something out for that eventually).
# ss -n | grep -c '127.0.0.1:10000\s*$' 27314 # sysctl net.ipv4.ip_local_port_range net.ipv4.ip_local_port_range = 15000 64000
According to https://stackoverflow.com/a/3923785, some other parameters that may be important are
# sysctl net.ipv4.tcp_fin_timeout net.ipv4.tcp_fin_timeout = 60 # cat /proc/sys/net/netfilter/nf_conntrack_max 262144 # sysctl net.core.netdev_max_backlog net.core.netdev_max_backlog = 1000 Ethernet txqueuelen (1000)
net.core.netdev_max_backlog is the "maximum number of packets, queued on the INPUT side, when the interface receives packets faster than kernel can process them." https://www.kernel.org/doc/html/latest/admin-guide/sysctl/net.html#netdev-ma... But if we were having trouble with backlog buffer sizes, I would expect to see lots of dropped packets, and I don't:
# ethtool -S eno1 | grep dropped rx_dropped: 0 tx_dropped: 0
It may be something inside snowflake-server, for example some central scheduling algorithm that cannot run any faster. (Though if that were the case, I'd expect to see one CPU core at 100%, which I do not.) I suggest doing another round of profiling now that we have taken care of the more obvious hotspots in https://gitlab.torproject.org/tpo/anti-censorship/pluggable-transports/snowf...