[tor-relays] Guard flag flapping

Sat Aug 8 00:05:33 UTC 2015

First, I am assuming you are running bare-metal on
a system and not in a virtualized server--everything
below is premised on that.  Do not expect a virtual
server or Linux container to perform well as a high-
capacity Tor relay.  It's possible to configure a
high-performance VM, but this is an esoteric art
and one is better off renting a small dedicated
physical server than going that route.

Your story of a relay setup that should measure
fast by all apparent metrics but is given terrible
rankings by BWauths is common this year.

BWauths scripts are known to be buggy, though
supposedly have been improved very recently.
'longclaw' just came back online with the "latest"
code, but after starting out with a failure to
measure 2000 relays two days ago, it's still
running 1000 shy of the full population:

https://consensus-health.torproject.org/#bwauthstatus

Scroll down a little and you will see 'longclaw'
is unique in voting 976 relays not-guard and 1709
relays not-fast.  That seems a more serious issue
than cold start glitching IMO, and is not
impressive if that is what it really is.

A fifth BWauth is said to be arriving soon and it
is said that it will help.

Your relays currently are measured thusly:

greendream848
longclaw-w Bandwidth=1694 Measured=986
gabelmoo-w Bandwidth=1694 Measured=347
maatuska-w Bandwidth=1694 Measured=874
moria1  -w Bandwidth=1694 Measured=1550

spacequeen974
longclaw-w Bandwidth=1698 Measured=493
gabelmoo-w Bandwidth=1698 Measured=970
maatuska-w Bandwidth=1698 Measured=1930
moria1  -w Bandwidth=1698 Measured=2130

You can see future and past reports of these in

https://collector.torproject.org/recent/relay-descriptors/votes/
https://collector.torproject.org/archive/relay-descriptors/votes/

where

longclaw is 23D15D9. . .
gabelmoo is ED03BB6. . .
maatuska is 49015F7. . .
moria1   is D586D18. . .

That the measurements are all in the same ballpark
does indicate that some subtle issue with the
network and/or equipment may be at work and the
BWauths may not be at fault.  But many have
complained that nothing they do seems to work.

If the firewall is performing stateful packet
inspection or any kind of DPI (deep packet inspection)
disable that for all incoming and outgoing Tor
traffic.  It's all encrypted anyway so there's
no point, and DPI can drag down performance
big-time.  The directory traffic is unencrypted
but I've never heard of a firewall with
stateful rules for the Tor directory protocol.

If you can put the system directly on the public
IP address with no firewall or local-rack router I
recommend doing this.  Just make sure iptables are
set to protect login and other non-tor access.
Either that or disable iptables and strip the
server down so that nothing but the 'tor' process
and 'ssh' are running, and configure 'ssh' to
accept only certificate authentication (be sure to
set and test the cert auth before applying the
setting).  Check for minimized listeners with

   lsof -Pn | fgrep LISTEN

The email daemon should stay up to handle alarms,
just be sure it listens on 127.0.0.1.  Likewise
anything else that is absolutely necessary.  Use
*Port and *Policy settings in torrc to lock down
control and socks access to the daemon.

One notable sysctl that matters for high-capacity
relays is

   net.netfilter.nf_conntrack_checksum = 0

though having this enabled would not cause the
current poor measurements.

You should change this setting:

   net.ipv4.tcp_no_metrics_save = 1

turning this off was to work around a very-
long-ago kernel bug that is fixed everywhere.
Turning it on improves performance.

You might try

   net.ipv4.tcp_wmem = 4096  250000  4194304
   net.ipv4.tcp_rmem = 4096  375000  4194304

which will cause the congestion window to
get to full size a bit quicker, and these

   net.core.somaxconn = 1024
   net.core.netdev_max_backlog = 524288
   net.ipv4.tcp_slow_start_after_idle = 0
   net.ipv4.tcp_keepalive_time = 600

which increase various limits for fast networks,
lots of connections.

Make sure these defaults values are active and
have not been changed to non-default by
/etc/sysctl.conf:

net.ipv4.tcp_moderate_rcvbuf = 1
net.ipv4.tcp_timestamps = 1
net.ipv4.tcp_window_scaling = 1
net.ipv4.tcp_sack = 1
net.ipv4.tcp_syncookies = 1
net.ipv4.tcp_congestion_control = cubic

And try adding

   TXQUEUELEN=100000

to the

   /etc/sysconfig/network-scripts/ifcfg-ethX

for the interface(s) where tor runs.  Manually
activated with 

   ip link set qlen 100000 dev ethX
   ip link show dev ethX

Finally make sure the kernel is of a vintage with
the Google-advocated connection-start
congestion-window increase:

https://lwn.net/Articles/427104/

http://samsaffron.com/archive/2012/03/01/why-upgrading-your-linux-kernel-will-make-your-customers-much-happier

http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=442b9635c569fef038d5367a7acd906db4677ae1

If you end up implementing any of the above and it
works please describe the results in tor-relays
post.