[tor-scaling] Exploratory Analysis of Latency Tor data

Dennis Jackson djackson at mozilla.com
Wed Jul 10 23:15:20 UTC 2019


Hi George,

On Wed, Jul 10, 2019 at 8:21 AM George Kadianakis <desnacked at riseup.net>
wrote:

> Dennis Jackson <djackson at mozilla.com> writes:
>
> > Hi all,
> >
> > I've spent a week or so digging into some latency measurements of the Tor
> > network.  I've put together some graphs and my observations in a PDF here
> > <
> https://drive.google.com/file/d/1e7yngIW9JkZiO8uwt6W6QrzOdg9nohGj/view?usp=sharing
> >,
> > which is created from a Google Slides Presentation (comments enabled)
> here
> > <
> https://docs.google.com/presentation/d/1vUqx7-fkNfy2xwtJXS9PCwQvTuVUwI7JwodNc-tJLTw/edit?usp=sharing
> >.
> > My cleaned up data sets and the source code for the graphs are also
> linked
> > at the end of the PDF in case anyone wants to play with it.
> >
>
> Hello Denis,
>
> this is really great and well-thought-out material! Thanks!
>
>
Thanks! :)


> > The takeaways:
> >
> >    - Lots of graphs.
> >    - The Jan 2015 inflection point in the metrics data is due to 'siv'
> >    changing ISPs. Tor still has a bad phase followed by a good phase,
> but the
> >    change is more gradual and begins earlier.
>
> Good catch. That seems about right!
>
> I also liked the token bucket refill explanation for the per-second
> horizontal
> bands. Seems like the fix there was proposal 183 (ticket #3630), which got
> merged upstream in September 2011, but the effects only started showing in
> mid
> 2013 (...)
>
> >    - There are still significant deviations between measurement servers
> in
> >    recent torperf data which are greater than can be explained by random
> >    chance.
> >    - There are some exit nodes running behind a VPN which doubles or
> >    triples round trip time and worsens UX. However, current
> client/consensus
> >    code does not (directly) punish this.
>
> How do we know that these exit nodes are running behind a VPN?
>
>
Sorry, this was speculation and is presented as speculation in the slides,
but looks like a claim here.
Running behind a VPN is my candidate explanation, because these nodes have
solid bandwidth and
their fastest latency measurement is still >1 second. As a relays fastest
measurements should coincide
with minimal delay on the relay itself, the only other source is on the
link into the relay. A VPN would
induce 1 extra round trip between the true relay and its VPN endpoint into
every measurement.
However, this isn't certain. It might be that the routes between the fixed
guard and some exits are
particularly bad or some other strange behavior.

I guess we could gather more evidence by pulling out these relays and
checking whether their IP
addresses correspond to commercial VPN providers or by reaching out to the
operators directly.




> >    - *A non-negligible fraction of relays, get into a 'broken' state
> where
> >    round trip time is either normal for the relay, or delayed for 6
> seconds. I
> >    can't find any explanation for this behavior. It seems to be
> consistent
> >    across Tor versions, host OS's and exit weighting. *
> >    - This is just an exploratory analysis. The dataset and analysis
> should
> >    be carefully examined before using it in any decision making.
> >
> > If anyone can shed any light on the '6 second mystery', I'd be quite
> > interested! It also impacts nearly 1% of the requests in one dataset,
> > suggesting it might be having a real impact on UX.
> >
>
> Yes this '6 second mystery' seems really interesting and potentially
> buggylicious! It's particularly fun that latencies seem to cluster up into
> distinct horizontal bands, which might hint into some sort of inefficient
> callback behavior that runs every second or every N miliseconds (like the
> new
> token bucket refill behavior) either on the client-side or the relay-side.
>
> I took a look at some of our per-second callbacks like
> second_elapsed_callback() but I could not find anything particularly
> related
> (apart from pseudo-interesting things like accounting_run_housekeeping()
> where
> relays apply bandwidth rate limiting, etc.).
>
> Is this '6 second' behavior exhibited at all by onionperf nowadays? Or is
> this
> mainly seen in Arthur's experiment?
>
>
It's funny you ask! Because I did the analysis on Arthur's dataset first,
then torperf, I never actually looked at a histogram of torperf's
measurements. I've just generated the histograms and you can find them as
pngs here
<https://drive.google.com/drive/folders/1yUxyGGHXLcGKCFCP91MrGSY7N-VoUZ8q?usp=sharing>.

Each histogram corresponds to all successful measurements (10 seconds or
less) for each measurement server. I've split them into pre-2014 ('early')
and 2014 or later ('late'). Unfortunately, there's not enough samples to
confirm or refute the 6 second peak. There is a suggestion of a peak around
5 seconds for some of the later relays. That could be noise (very sparse
samples remember) or it could be the 6 second peak varies on RTT
significantly. Remember Arthur's dataset used a fixed guard, so the peak
may vary based on the selected guard node. More measurements are needed!
However, there are still some interesting observations to be made. The
early-late distinction is huge. In particular, we see the peaks
corresponding to the 1-second buckets and that they disappear between
samples. There is also a clear peak around 100ms intervals, which I think
is the new value for the bucket refilling and probably not a problem.


> I originally suspected that this is some sort of issue on the client-side
> (i.e. Arthur's latency collecting tor) but the graph on page 59 seems to
> imply
> that it's specific relays that demonstrate this 6 second behavior. I was
> also
> thinking that perhaps this was due to KIST, but it seems like there is a
> reasonable amount of 0.2.x relays exhibiting this issue (page 62), and
> KIST got
> introduced in 0.3.2.
>

This was my thought as well initially, however we only see it on a subset
of relays (rather than all) and there's a clear change over time in
specific relays which seems uncorrelated with each other. For example, page
77, 79, and 82 all show some relays changing between 'bad' and 'good'
states. It seems hard to explain that as a client side issue, but its also
hard to rule out without checking further.


> I haven't had the time to look at Arthur's script to see how latency gets
> calculated and how it handles errors, for example, in cases where a
> circuit or
> a stream expires (see circuit_expire_building() or
> connection_ap_expire_beginning() which are again called on a per-second
> basis).
>
> I threw away all failed measurements. Everything in that slide deck is
only considering successful measurements that finished within 10 seconds.
Looking at the failures might be even more interesting of course!

If the issue is not on the client-side, then according to the graphs this
> has
> to be some pretty ancient inefficient Tor behavior which runs even before
> 0.2.9. Seems like worth digging into more, while keeping in mind the
> bigger picture.
>
> ---
>
> Thanks for the great analysis! That was a fun presentation.
>

Great to hear! Hope you all have a good week! :)

Best,
Dennis
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.torproject.org/pipermail/tor-scaling/attachments/20190710/4c815669/attachment.html>


More information about the tor-scaling mailing list