Analysis of the problems many relay operators are currently facing
mail at sebastianhahn.net
Wed Apr 21 16:04:24 UTC 2010
I'll try to summarize here what I've learned in the past weeks over the
problems we are currently having with the Tor network as a whole, and
issues that individual relay operators have; as well as describing the
we have identified (some of which have been adressed already). As the
information comes from #tor-dev on OFTC, bug reports and mailing
lists, but no
overview exists, it seems worthwhile to collect what we know.
For the past months, quite a few relays have been sporadically
the consensus; either until they published the next descriptor again
longer periods of time. For this we have identified the following
Some vendors have backported openssl features to older versions,
those relays either completely useless as they are unable to
connections so won't even bootstrap; or useless as relays to
certain openssl versions. Thus, the directory authorities couldn't
establish connections to them, meaning they marked them offline.
We believe this is now fixed as of 0.2.2.11-alpha. A fix for the
series of Tor has not been released yet.
Authorities only downloaded descriptors for relays from V2
authorities if they didn't have them available themselves. As
V2 auths remain, one of which probably disallows most relays from
publishing descriptors, this led to authorities knowing only
about a part
of the network. Some relays were thus unreachable by the majority
dir authorities, meaning they dropped out of the consensus.
We believe this is now fixed as of 0.2.2.12-alpha. Not all
have upgraded yet.
Relays (and authorities) running 0.2.2.11-alpha crash 24 hours
if they have the statistic gathering functionality enabled.
We believe this is now fixed as of 0.2.2.12-alpha. A workaround
disable statistic gathering.
Another issue exists that has not been identified yet, where a
only reachable from outside sporadically, even though there is no
This issue is rare and has not been reproduced reliably.
Another class of problems exists which affects some/many relays: The
attracts a huge amount of connections, affecting stability of network
and operating system. These problems might occur:
The Tor process runs out of memory, because it has too many open
connections. Tor is then killed by the OS's OOM-killer.
Tor exhausts the ulimit -n that is affecting it, meaning random
like opening logfiles, establishing new connections or gathering
entropy fail, often creating many warnings in Tor's logfile. In
cases it appears that Tor is spinning until a file descriptor
available, burning all cpu.
Tor makes a home router/DSL modem/kernel lock up, because it cannot
handle the load. Symptoms include that internet access is
nonfunctional even after the relay is stopped, or that it is
slow. These symptoms might last until the relevant piece of
All these share the same underlying problem: Tor is getting more
connections than it can handle. One way to help would be to make
unused connections are closed more quickly, so that relays don't
to maintain as many active connections concurrently as they need
now. A Tor patch that logs what state current connections have
that on some systems, around 10% of all connections were used for a
begindir operation before, but now don't have a circuit attached
Generally, the fraction of connections used exclusively for begindir
operations appears to be high, so it might be worthwhile to close the
circuits on them more quickly and not keep them around for possible
Another theory is that the fastest relays (by consensus weights) are
by a large proportion of users. This means that almost every Tor user
make a connection to those few relays, massively increasing the
connections the relay has to handle at the same time. Some evidence
supporting this is that even after the bw authorities voted
weight down a lot after the operator lowered the banwdidthrate
it was still seeing many concurrent connections, while the amount of
connections/s was dropping a lot.
As many relay operators are forced to turn off their relay because they
don't have the resources to keep their relay up anymore, the problem
gets worse for the other operators, who need to deal with an unchanged
number of clients.
One last concern is that we're seeing scalability problems with our
design. Lots of chinese users are back on the network, as many relays
been unblocked by the gfw. Some relays are seeing more than 40k active
connections, while being far away from reaching their bw limits. If
increases to grow and a clear bug cannot be identified that causes the
massive amount of connections and it can be determined that this is
Tor's popularity growing, alternative designs that don't require
tcp connections might become a necessity very quickly.
I hope I didn't forget any problem/solution/analysis here, if so,
please add it
so we can all track this down as quickly as possible.
More information about the tor-dev