[tor-scaling] High level scaling goals (and metrics categories)

Sat Jun 8 23:30:00 UTC 2019

George Kadianakis:
> Mike Perry <mikeperry at torproject.org> writes:
> 
>> Notes are on the pad (feel free to update; beware any rando on the
>> Internet also can do so): https://nc.riseup.net/s/AEnQ4CRH2kH3fLe
>>
>> In the meantime, with input from folks on this list and on the wiki
>> page, I would like to add the EWMA re-tuning experiment, fill out the
>> KIST tuning experiment, and flesh out the metrics section to highlight
>> metrics that need new data collection. (I will start separate threads
>> for this on-list as I run into questions -- I have several already).
>>
> 
> Hello list,
> 
> during the recent call, I really related to Arthur's comment about the
> need to clarify our planned metrics/experiments in a high-level way and
> also fit them into a high-level strategy.
>
> So I went to the wiki page [0] and tried to make some high-level
> categories of our scaling goals and also fit our metrics into them.

Thank you! This categorization is a good idea to make explicit, and we
should put a categorization on the wiki, but your categories have some
issues (as you suspected).

I have been using the categories "latency, throughput, capacity,
reliability", as these are the categories of things Mozilla wanted to
measure (as well as asking for best/worst case for many of these).

Variance, on the other hand, is a property of all metrics, and we want
to ensure we capture it somehow in each category. Representing variance
(and also best/worst case) is a data visualization problem, not a
metrics record keeping problem, though.

For latency and throughput, a CDF graph is a very rigorous way to
capture variance as well as best and worst case, and everything in between.

A quantile plot (or rainbow plot) is another way to represent variance.
Most of our graphs at https://metrics.torproject.org/torperf.html
currently display the "25% quantile aka 1st quartile" and "75% quantile
aka 3rd quartile" around the average (mean). Mozilla asked to see the
best vs worst case. So, while we're at it, we might as well try to find
a way to see the overall shape of that distribution (to ensure its
density is narrow and not lumpy).

(I hope to have these kinds of data visualization discussions with
Karsten and the metrics team at Mozilla all hands, during our prelim
meetings; but we don't need to bore everyone to death with them, except
for people who want to subject themselves to that, but...welcome to a
cross-disciplinary, cross-org, cross-everything mailinglist! ;)

> ====================================================================================
> 
> == High level scaling areas:
> 
> ==== Latency
>      (or, how fast data flows?
>
>      Metrics: CDF-TTFB, CDF-TTLB, CDF-DL

CDF-RTT belongs here. CDF-RTT is the echo server metric discussed for
EWMA; I prefer the echo server version to SENDME counting, since we will
be changing Tor's flow control as part of this work.

CDF-DL does not belong here. CDF-DL measures steady-state throughput,
completely independent from latency (unless our flow control cannot fill
the bandwidth-delay product of the path).

CDF-TTLB does not belong here. It is a measure of throughput for shorter
downloads, including the protocol handshakes (to capture throughput for
common use cases of ~5MB HTTP objects).

>      Notes: Is this the same as "throughput"? Mike mentioned them as
>             separate areas in the meeting.

Latency is independent from long-term (aka steady-state) throughput, in
well-designed end-to-end protocols.

Latency can affect throughput only when a throughput metric also
measures latency as a side-effect (like CDF-TTLB does, by including
protocol handshakes in its throughput measurement, because protocol
handshakes are dominated by latency).

> ==== Consistency / Performance variance
>      (or, how surprised you might be at Tor's overall speed)

I call this throughput. Throughput is measured short-term and long-term.

>      Metrics: CDF-TTFB, CDF-TTLB, CDF-DL

Only CDF-TTLB and CDF-DL are here.

> ==== Network capacity
>      (or, how many more clients can this network fit?)
> 
>      Metrics: Per-Flag Spare Network Capacity, Per-Relay Spare Network Capacity

Yep.

Aside: A related "capacity balance" (or "throughput variance") metric is
"Per-relay spare stream capacity". "Per-relay spare stream capacity" is
what Torflow and sbws bandwidth authorities use as their load balancing
target. We can derive the bwauth measurements of this value from the
consensus, by dividing a relay's consensus "measured" value by its
descriptor "observed bandwdith" value. We can also take it directly from
sbws or torflow bandwidth files. A plot of these values will show us how
individual relays vary in their ability to carry an additional stream.

> ==== Reliability/Failures
>      (or, how frequent connections fail and have to retry)
> 
>      Metrics: Failure rainbow, Circuit timeout

Yes.

>      Notes: This might not be user visible, but impacts "latency" and "consistency"

Yeah it depends on the type of failure. CDF-TTFB will capture the effect
of retry-able failures; the CDF will have a large bump or longer hill
slope on the top right.

> ====================================================================================
> 
> Some thoughts:
> 
> a) We are trying to fit metrics into high-level areas, not experiments
>    into areas, right? Mike seems to have done the opposite in the latest
>    meeting, so I'm not quite sure how to do this.

We're doing all of the above; we're doing a few things concurrently;
this is intentional. The reason to do it this way is to ensure that we
have a complete span of metrics that measure everything we need, and to
ensure that these metrics measure these things without model error.

Once we start tuning things, it will be much more costly and confusing
to add/remove/change metrics, because those metrics will not be
comparable with historical data and prior simulation runs, or to the
research literature. Since a lot of our performance features work
together and have different emergent effects under different network
conditions, it is incredibly important to establish this baseline, and
understand its limitations.

So, specifically, I am doing this:

1. I am driving the selection of a set of metrics that measure latency,
throughput, capacity, and reliability. Ensure our visualization of these
metrics captures variance, and also captures density and best/worst span.

2. I am enumerating our planned improvements (performance tuning,
development, research, and other pipeline stages).

3. I am ensuring that all of our planned improvements/ideas can be
measured by our metrics, and ensure that the metrics span latency,
throughput, capacity, and reliability in expected and well-understood ways.

4. I am looking hard for sources of model error at every step, to ensure
we are measuring what we need to, and ensure that it represents reality
as closely as possible.

In case it is not clear, by doing this, I/we have already learned that
some of our metrics from Torperf do not model users as accurately as we
need them to.

In particular, to measure the effects of CBT tuning, we need to use
Guard nodes with Torperf. For EWMA, we want to use the CDF-RTT echo
server, to measure circuit latency over time. For predictive circuit
building, we need a user activity model. For sbws/torflow, we want
"capacity balance" or "per-relay throughput" metrics. For browser-based
changes, we probably want to use Mozilla's A/B testing infra and
browser-based perf metrics.

Short term, the most important part of all this
review/analysis/discussion is figuring out which of these metrics we can
deploy easily right now vs which metrics need development effort to
start capturing, so this development can progress in parallel to the
tuning experiments that do not require those metrics.

> b) I think one of the most important things to learn here is how these
>    different areas interact with each other. In particular, I think
>    "Network capacity" is a super important area since we are looking at
>    a huge influx of users, but we can't really look into it isolated,
>    since we are interested in seeing how "Latency" and "Consistency"
>    changes as more clients come in the game.

Yes, I suspect that Mozilla will be most interested in how we measure
network capacity, or at least will be interested in how we will be able
to tell if we can accept more additional users or not.

This brings up another set of experiments that has come up in passing in
this thread already: Mozilla A/B tests. Mozilla has the infrastructure
to turn features on and off for subsets of *their* userbase, and then
collect browser-specific perf metrics from populations with and without
the change.

I think we should take a hard look at using Mozilla's infra to test some
of our features in this way, as well as to gradually introduce
quantities of users to the network and measure that effect on our spare
capacity metrics (while also watching other metrics for signs of strain).

>    How do we model the way these areas interact with each other? And
>    maybe the fact that they are not disjoint means that I have not
>    modeled the areas correctly.

Yes, it does mean that your categories were off. Good intuition! And no
worries, we're still in the brainstorming phase here. I'm realizing
stuff is missing, too.

> c) Have I missed any important areas? Are we missing metrics for any
>    important areas? Is this helpful? At some point we should scribe
>    these on the wiki, but I'd like some more thinking to happen first.

Yes, "one ore thing": I want to do a pass through our metrics and flag
things that require new measurments/data collection (like CDF-RTT,
and/or Guard-enabled Torperf runs). I also want figure out if we want to
use any of Mozilla's browser metrics, and figure out how closely we
model browser activity, if Mozilla's metrics are not an option.

> [0]: https://trac.torproject.org/projects/tor/wiki/org/roadmaps/CoreTor/PerformanceExperiments

FYI: I plan to split the above page into just a metrics page, and a
separate experiments page.

Also, the things on that page are actually "Performance Tuning". I would
have freaked people out a lot less if I used accurate terminology, by
calling this page "tuning" instead of "experiments", but you're all
paying attention now, aren't you? Muahahaha, nerd sniped! ;)

Actual experiments (ie: low-hanging-fruit dev tasks, reproducing
published research papers, and conducting new research) should be their
own pages/sections, for each phase of the R&D pipeline.

But every phase of the pipeline must use the same metrics. That's the
key thing that will make the pipeline efficient. It will be easiest to
move stuff from research idea to live network quickly if we have
confidence in our metrics, and have confidence that our simulators will
produce metrics that match what happens on the live network, and
finally, have confidence that this is actually what our users will
experience (aka no model error).

-- 
Mike Perry

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 833 bytes
Desc: OpenPGP digital signature
URL: <http://lists.torproject.org/pipermail/tor-scaling/attachments/20190608/2230cbfc/attachment.sig>