[tor-scaling] Exploratory Analysis of Latency Tor data
desnacked at riseup.net
Wed Jul 10 15:21:12 UTC 2019
Dennis Jackson <djackson at mozilla.com> writes:
> Hi all,
> I've spent a week or so digging into some latency measurements of the Tor
> network. I've put together some graphs and my observations in a PDF here
> which is created from a Google Slides Presentation (comments enabled) here
> My cleaned up data sets and the source code for the graphs are also linked
> at the end of the PDF in case anyone wants to play with it.
this is really great and well-thought-out material! Thanks!
> The takeaways:
> - Lots of graphs.
> - The Jan 2015 inflection point in the metrics data is due to 'siv'
> changing ISPs. Tor still has a bad phase followed by a good phase, but the
> change is more gradual and begins earlier.
Good catch. That seems about right!
I also liked the token bucket refill explanation for the per-second horizontal
bands. Seems like the fix there was proposal 183 (ticket #3630), which got
merged upstream in September 2011, but the effects only started showing in mid
> - There are still significant deviations between measurement servers in
> recent torperf data which are greater than can be explained by random
> - There are some exit nodes running behind a VPN which doubles or
> triples round trip time and worsens UX. However, current client/consensus
> code does not (directly) punish this.
How do we know that these exit nodes are running behind a VPN?
> - *A non-negligible fraction of relays, get into a 'broken' state where
> round trip time is either normal for the relay, or delayed for 6 seconds. I
> can't find any explanation for this behavior. It seems to be consistent
> across Tor versions, host OS's and exit weighting. *
> - This is just an exploratory analysis. The dataset and analysis should
> be carefully examined before using it in any decision making.
> If anyone can shed any light on the '6 second mystery', I'd be quite
> interested! It also impacts nearly 1% of the requests in one dataset,
> suggesting it might be having a real impact on UX.
Yes this '6 second mystery' seems really interesting and potentially
buggylicious! It's particularly fun that latencies seem to cluster up into
distinct horizontal bands, which might hint into some sort of inefficient
callback behavior that runs every second or every N miliseconds (like the new
token bucket refill behavior) either on the client-side or the relay-side.
I took a look at some of our per-second callbacks like
second_elapsed_callback() but I could not find anything particularly related
(apart from pseudo-interesting things like accounting_run_housekeeping() where
relays apply bandwidth rate limiting, etc.).
Is this '6 second' behavior exhibited at all by onionperf nowadays? Or is this
mainly seen in Arthur's experiment?
I originally suspected that this is some sort of issue on the client-side
(i.e. Arthur's latency collecting tor) but the graph on page 59 seems to imply
that it's specific relays that demonstrate this 6 second behavior. I was also
thinking that perhaps this was due to KIST, but it seems like there is a
reasonable amount of 0.2.x relays exhibiting this issue (page 62), and KIST got
introduced in 0.3.2.
I haven't had the time to look at Arthur's script to see how latency gets
calculated and how it handles errors, for example, in cases where a circuit or
a stream expires (see circuit_expire_building() or
connection_ap_expire_beginning() which are again called on a per-second basis).
If the issue is not on the client-side, then according to the graphs this has
to be some pretty ancient inefficient Tor behavior which runs even before
0.2.9. Seems like worth digging into more, while keeping in mind the bigger picture.
Thanks for the great analysis! That was a fun presentation.
More information about the tor-scaling