[tor-scaling] notes & next meeting

Mon Apr 29 21:59:15 UTC 2019

Roger Dingledine:
> On Fri, Apr 26, 2019 at 11:54:50AM -0700, Gaba wrote:
> > Roger points out four things to watch out for during these experiments:
> 
> Here were my four things:
> 
> (A) 72-hour (or 48-hour) cycle is not as good as a 7-day cycle: we will
> be sad if we don't account for cyclic behavior on the real network. Like,
> comparing a Tuesday to a Saturday will result in surprises. Better to
> compare a Tuesday to a Tuesday. We learned this lesson before, in the
> user count anomaly detector: see e.g. the weekly pattern in
> https://metrics.torproject.org/userstats-relay-country.html?start=2017-01-27&end=2018-01-01&country=ae&events=on
>
> (B) Even if we do 7-day cycles (turn on the feature for 7 days,
> turn it off for 7 days, compare), some state will still bleed over
> between cycles. In particular, clients will have already picked their
> guards before the experiment. Or the inverse, they pick them during
> the experiment, and still have them afterwards. That latter issue is a
> great example of why we need to think through anonymity impact of each
> experiment -- the effect of an experiment can last for months after we
> turn it off.

At a couple different points during the meeting, people made comments
like the above two, where they stated their concern about things that I
had already made an effort to account for in the performance experiment
plan wiki page and in my long-form mail. Hearing them over and over
again without any reference to the work I did on the plan is
frustrating; I don't know if what I did was insufficient, not read, or
not understood (should I write *more* long emails? ;).

I know that I only sent out the list of performance experiments the day
before the call, and the strategy document was very long and it was easy
to miss the points that I made about the above, but that's exactly why I
want to keep us focused on the experiment and strategy plan documents
rather than making broad comments without grounding them in the
specifics of the plan.

Going forward, I want to keep comments like these focused on proposed
changes to the experiment plan wiki page, the kanban, and/or the
strategy document in ways that persist.

> (C) There are other metrics to look at when assessing whether an experiment
> is working, e.g. total network bandwidth used. Maybe that's just the dual
> of the existing user-side performance metrics ("when user throughput
> goes up, total network load will go up too because everybody will be
> getting more of it"), or maybe they're different and need to be assessed
> separately. More broadly, once we start one of these experiments,
> and we're trying to look at everything we can to see if it's working,
> we should watch what we do (what we look at) and fold that into the plan
> for later experiments.

As per the strategy document, I think minimalism is key here. I really
want to avoid measuring everything in the kitchen sink in our metrics
for every experiment, because this is both costly and confusing.

By the time that a performance feature is being tested on the live
network, we better be able to predict the metrics we expect it to
impact, or something went really wrong with our research and development
effort up until that point. I think Nick's suggestion for a field in
each experiment that says "what do we need to verify the parameter is
actually applying" and another field for "Abort Criteria" are good ways
to capture the few extra things we might want to keep an eye on in each
case, to guard against either catastrophic failure or just measuring
noise, but what these things are will also be specific to each
experiment.

Again, rather than arguing this point in the abstract, let's focus on the
experiments we're actually planning to do and see what is needed, and
record it on the wiki page. And, when what is on the wiki page is
insufficient, we should add more fields to the experiment template.

> (D) Spare network capacity (advertised bandwidth minus load) is tricky
> as a metric, because our advertised bandwidth is a function of relay
> load. So it's easy to end up with circular analysis, where e.g. we add
> more load onto a relay which causes it to "discover" that its capacity
> is higher -- which counterintuitively means that increasing the load on
> the Tor network could result in more spare capacity. So we need to be
> really careful drawing conclusions about this spare bandwidth metric --
> and we might be best served by finding some other metric, or finding a
> novel way to measure what we think is the real capacity of a relay.

This metric is not actually used in our experiments yet. We currently
only use "Per-Relay Spare Network Capacity CDF" as a way of checking
that each relay comes as close to its peak as all other relays, which is
a way of measuring load balancing for the Fast Relay Cutoff experiment.

In other words, we only use this as a relative metric -- any
discrepancies caused by relays only sometimes hitting their peak will
show up in this CDF, and indicate load balancing issues of the kind that
the metric was designed to discover.

But, as an example of trying to capture concerns like this in the
experiments page, I have added the following to the "Per-Flag Spare Network
Capacity" metric on the wiki page:

  XXX: Note that "advertised bandwidth" is not an accurate reflection of
  peak capacity of a node -- we might want to extract the highest
  advertised bandwidth value over longer periods of time (eg 1 month) for
  each node to get a better reflection of peak capacity for use in
  deriving this metric. 

-- 
Mike Perry
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 801 bytes
Desc: Digital signature
URL: <http://lists.torproject.org/pipermail/tor-scaling/attachments/20190429/647d1e1c/attachment.sig>