# A Brief Study on Circuit Construction Speed and Reliability

Mike Perry mikepery at fscked.org
Sun Dec 17 02:31:17 UTC 2006

While testing the latest relese of my Tor scanner, I decided to do a
study on circuit reliability and how long it takes to construct a
circuit then fetch the html of http://tor.eff.org, and also to
fetch http://tor.eff.org via that same constructed circuit.

Using tor-0.1.2.4 (actually SVN r9067), I sorted the routers by their
bandwidth capacity, divided them up into 15% segments of the network
from 0 to 90, and for each segment timed 250 circuits as well as used
the new failure tracking abilities of my scanner to track the failure
rates of nodes as well as the failure reasons for circuits and
streams.

Times are seconds:

RANGE 0-15 250 build+fetches: avg=20.89, dev=31.23
RANGE 0-15 250 fetches: avg=3.66, dev=2.69

RANGE 15-30 250 build+fetches: avg=33.44, dev=47.01
RANGE 15-30 250 fetches: avg=7.28, dev=12.86

RANGE 30-45 250 build+fetches: avg=81.47, dev=79.55
RANGE 30-45 250 fetches: avg=12.66, dev=38.63

RANGE 45-60 250 build+fetches: avg=63.56, dev=67.56
RANGE 45-60 250 fetches: avg=7.51, dev=12.80

RANGE 60-75 250 build+fetches: avg=40.85, dev=42.76
RANGE 60-75 250 fetches: avg=10.13, dev=11.28

RANGE 75-90 250 build+fetches: avg=48.87, dev=56.11
RANGE 75-90 250 fetches: avg=6.82, dev=7.48

As you can see, the high bandwidth nodes in 0-15% are much quicker
than the rest both at using existing circuits and at building new
ones. My guess is that the circuit build speed increase is likely due
to the fact that running a fast node requires a fast machine to be
able to do all the crypto, and thus crypto-intensive circuit builds
execute faster on these nodes.

The rest of the results for circuit construction and speed seem only
loosely tied to bandwidth, however. Probably other factors like
network connection and stability come into play there. A few bad nodes
can slow those averages down a lot, as is hinted at by the large std
deviation in some of the classes.

So what of the failure rates and reasons then? Lets have a look at the
FAILTOTALS line from each class:

0-15.naive_fail_rates:  FAILTOTALS 131/473 54+6/603 OK
30-45.naive_fail_rates: FAILTOTALS 559/1221 130+29/737 OK
45-60.naive_fail_rates: FAILTOTALS 273/845 138+22/752 OK
60-75.naive_fail_rates: FAILTOTALS 140/592 85+33/678 OK
75-90.naive_fail_rates: FAILTOTALS 187/637 76+18/656 OK

By looking at the README for the scanner, we see the format of these
lines is:

250 FAILTOTALS CIRCUIT_FAILURES/TOTAL_CIRCUITS DETACHED+FAILED/TOTAL_STREAMS

So it looks that nodes in the 30-45% range seemed to have a good deal
higher rate of circuit failure than the rest (if you're wondering, the
overall circuit failure rate is 33%).

Looking at the top of the 30-45.naive_fail_rates file shows us a
handful of nodes with slightly higher failure rates than normal, but
several of the other classes have a few bad nodes also. So why was
this class so much slower?

It turns out if you look at the naive_fail_reasons file, the largest
portion of failures comes from CIRCUITFAILED:TIMEOUT reason:

250 REASONTOTAL 522/1277

or 522 timeout failures out of all the total node failures. Note that
reason-based failure counting and reason totals are node-based, where
as the FAILTOTALS lines just count circuits and streams, hence the
large number there.

In general, the most common failure reasons were circuit timeouts,
stream timeouts, and OR connection closed (TCP connections between
nodes mysteriously dying or failing to open).

Here's the top failure reasons by class. When there are 3 reason terms
paired together, the reason was reported from an upstream node and not
deduced locally.

0-15:
1. CIRCUITFAILED:OR_CONNECTION_CLOSED (174/322 node failures)
2. CIRCUITFAILED:TIMEOUT (72/322 node failures)
3. STREAMDETACHED:TIMEOUT (41/322 node failures)

15-30:
1. CIRCUITFAILED:OR_CONN_CLOSED (182/623)
2. CIRCUITFAILED:DESTROYED:OR_CONN_CLOSED (116/623)
3. CIRCUITFAILED:TIMEOUT (124/623)

30-45:
1. CIRCUITFAILED:TIMEOUT (522/1277)
2. CIRCUITFAILED:OR_CONN_CLOSED (396/1277)
3. CIRCUITFAILED:DESTROYED:OR_CONN_CLOSED (192/1277)

45-60:
1. CIRCUITFAILED:TIMEOUT (306/706)
2. CIRCUITFAILED:OR_CONN_CLOSED (164/706)
3. STREAMDETACHED:TIMEOUT (138/706)
4. CIRCUITFAILED:DESTROYED:OR_CONN_CLOSED (72/706)

60-75:
1. CIRCUITFAILED:TIMEOUT (112/398)
2. CIRCUITFAILED:OR_CONN_CLOSED (110/398)
3. STREAMDETACHED:TIMEOUT (85/398)
4. CIRCUITFAILED:DESTROYED:OR_CONN_CLOSED (56/398)

75-90:
1. CIRCUITFAILED:TIMEOUT (216/468)
2. CIRCUITFAILED:OR_CONN_CLOSED (96/468)
3. STREAMDETACHED:TIMEOUT (76/468)
4. CIRCUITFAILED:DESTROYED:OR_CONN_CLOSED (56/468)

So if you total the two OR_CONN_CLOSED (local and remote), you see
that for some reason node to node TCP connections are fairly
unreliable and prone to being closed (or are difficult to
open/establish?). This is strange...

I should also note that stream failure reasons are only counted for
the exit node, where as circuit failure reasons are counted for 2
nodes - the last successful hop and the first unsuccesful one. So in
effect, the STREAMDETACHED reason really is 2x more common than in
those lists. On the other hand, it is mostly alleviated by making
compute_socks_timeout() always return 15 (this was not done for this
study, however).

Well that's about all the detail I have time to go into right now. The
complete results are up at
http://fscked.org/proj/minihax/SnakesOnATor/speedrace.zip

As soon as I finish polishing up my README and change log, I will put
up the new release of SoaT itself up. Should be by sometime today.

--
Mike Perry