[tor-scaling] Tor Scalability Strategy: The Scientific Pipeline (+Doodle Poll)

Sat Apr 6 00:31:32 UTC 2019

I've been speaking with Isabela, Roger, and Nick, and we've decided that
we need a clear strategy to organize and build our scalability plan on
top of. I'm going to lead the creation of this strategy, and manage its
execution.

The TL;DR description of what we plan to do in the immediate term is
written in Section 0: Executive Summary. Everything proposed here is
subject to us obtaining funding and resources to do it, which will
involve discussion in subsequent scalability meetings with the network
team, the metrics team, and everyone else on this list. At the end of
this mail is a doodle poll for our next IRC meeting. Please fill it out!

Because there are many different types of people subscribed to this
list, and because this strategy will be the basis for several funding
proposals, and because the process that I am describing will also apply
to traffic analysis, anonymity metrics, and other research areas, and
because it will also improve our network development speed and
engineering confidence, I want to spend the time to describe the "Why"
behind the "Executive Summary".

This extended strategy explanation is done in five more sections, after
the executive summary. Section I explains a plan to create an assembly
line out of Tor's existing waterfall R&D model. Section II explains how
we will roll out this assembly line process in stages, through
experimentation. Section III explains how we will do prioritization and
resource allocation while using the process. Section IV specifies where
we will start and gives specific experiment examples. Section V goes
over what we should discuss in our next meeting, and has a doodle poll
for that meeting.

0. Executive Summary

Tor Project, Inc wants to improve the performance and scalability of the
Tor network in measurable ways. 

There are a large number of potential performance enhancements that
range from parameter tuning on already-deployed features, to full scale
research and development tasks. Evaluating which changes are worthwhile
requires experimentation that is trustworthy and reproducible.

In particular, Tor Project, Inc has some features that are already built
that need parameter tuning, right now.

Tor Project, Inc does not have standardized, calibrated testing
network(s) to evaluate tuning these features. We have Rob Jansen's
Shadow simulator models from over the years, and we have Dave Goulet's
real-world testing network, and we have Chutney. Other research
organizations also have their own custom opaque testing methodology.

Proposal: Because we already have some parameter tuning of existing
features to do, we will kill two birds with one stone by performing
these A/B parameter tuning experiments on our live network, and then see
which testing network models are able to reproduce the changes we saw in
our performance metrics on the live network, for the least expense.

Expected Result: In this way, we not only tune our live systems that
need tuning, but we also discover, through experimentalism, which
testing network models are able to match our parameter tuning results to
what we saw on the live network, and which testing network models were
not accurate enough to do this.

Long Term Plan: We then use those testing network models for future
experiments, *before* running those experiments *again* on the live
network, to verify their results in practice.

Long Term Maintenance: When the live network ever fails to reproduce
what our testing network predicted, that indicates model error in the
testing network, and we have to perform more experiments on both to
determine why their results disagree, just like we did at the beginning.

Philosophy of Scope: We will favor a strategy of metrics minimalism to
avoid scope creep. For this project, we only care about issues that
cause our testing networks to predict different results in a few key
user-facing performance metrics that model typical Tor user experience.
Later projects can repeat this process for other problem domains
studying different metrics (anonymity, traffic analysis, more accurate
Tor Browser usage models, etc). High resolution "full take" data will
only be used to investigate and debug sources of model error between
testing and production (and of course during the development phase of
new performance features).

I. The Scientific Pipeline

The motivation for conducting the project in this way is to streamline
and parallelize Tor's long R&D cycle. We will do this by creating a
pipeline of experimentalism; an assembly line of science.

So, let's start with the scientific method as applied to software
research and development. The basic flow is:

  Problem -> guess a solution -> predict results -> control/model solution
           -> experiment/test -> measure -> reproduce/repeat

In research scenarios, this process is just called the scientific
method. In software engineering scenarios, this process is often called
"test-driven development", and also "A/B feature testing".

Those of us who have been around Tor long enough know that we also have
an established flow of science-based R&D:

                Ideas -> Prototype -> Research Paper
        -> Funding&Tor Proposals -> Develop -> Test -> Production

We tend not to control the first line (the research phases) very much.
We put research ideas out there, but the form of the prototype and the
nature of the experiments that are published in the research paper are
often not very useful to us, or not complete. See
https://blog.torproject.org/how-do-effective-and-impactful-tor-research

For the second line, there has a been a trend over the past 10 years or
so in software development to run the three phases in a very tight loop
(Dev->Test->Prod), and rely on test-driven development. This is now
accepted as best practice in many very large and successful companies.

Right now, Tor has some CI, but it is not able to execute this second
line in a tight loop. This is due to incomplete code coverage, lack of
comprehensive integration tests, and low confidence that when our tests
pass, everything is really working as intended. But, with funding, we
would have infrastructure that supported a modern Dev->Test->Prod flow,
because it is established best practice for rapid software development.

For performance and scalability related work, this would mean we would
have integration tests that run in CI that checked baseline performance
and anonymity metrics under load. We would have a dedicated testing
network or large-scale simulator that produced performance and anonymity
metrics that consistently matched our live network for testing large
features. We would have the ability to run A/B tests on the live network
(executed as global consensus feature On/Off tests for anonymity
fragmentation reasons), to study the effects of changes on our metrics
as we roll features out live.

In this world, our performance-related Tor proposals would be
experiment-oriented, and would predict the kinds of results they expect
to see on our performance metrics, which would help us have confidence
that our implementations are working as expected, and thus increase
development velocity.

As we use this loop, we will naturally learn common sources of
model/testing error from each stage. This is because many
changes/experiments may improve performance metrics in earlier stages,
but fail to behave as predicted at later stages. When these failures are
significant, they will necessitate updates to earlier testing
mechanisms.

This means our testing infrastructure will need maintenance and
continual improvement, and therefore its own dedicated developers to
improve and maintain it. Elsewhere, this type of team is called DevOps,
QA, or Availability Engineering. We might call it Network Health.

Once we have this Dev->Test->Prod modeling capability for ourselves, we
will share it with the research community (parts of it might actually
come from them, such as a metrics-calibrated Shadow simulator). This
will help ensure that their results are standardized in the same way
that ours are.

For cases where research results are not properly standardized, and/or
for our own reproducibility, we would ideally have our own research
programmers, who can adapt research code into experiments that run on
our infrastructure and against our metrics.

If those results are promising on our testing network, the Network Team
will take the feature and make it into production-quality code, ready
for On/Off A/B tests on the live network.

These are three groups (external researchers, internal research
programmers, and the Network Team), operating in parallel, with
infrastructure support from the Network Health/DevOps team. In this way,
we have converted Tor's long R&D waterfall into an assembly line, or
pipeline, of science.

Note that these groups are not strict, either. They are more like roles
or work stations. Individuals may switch from role to role, possibly
following the lifespan of their proposed feature or research project.
This will not only ensure sufficient domain knowledge to quickly deal
with any unexpected issues, but it will also avoid ossification and help
reduce model error at each stage.

Managing the pipeline (Section III) will include ensuring that each
stage has people working on something in that stage.

II. How to Build a Scientific Pipeline

The above system will be expensive to keep running, and we don't even
have resources to build any of it right now. But, we can make use of
initial pieces we already have, and add in those that we build in an
incremental fashion.

Because we have to do them anyway, we should start with A/B On/Off
experiments on the live network, tuning our existing performance
features to evaluate their effects on our existing metrics (and metrics
easily derived from our existing data collection).

As we perform these tuning experiments, we should record the state of
the live network, and record the effects on a few key performance
metrics. The change in these metrics (positive, negative, and 0-effect)
should eventually be confirmed on a permanent testing network, or a
large-scale Shadow simulator model that is reproducible in nature.
Networks and simulator models that cannot confirm the results from the
live network will be ruled out for future testing, until they can be
adjusted such that they actually reproduce results from the live
network. We will favor the least expensive model that can reproduce live
results.

In that way, we know that our testing system is at least accurate enough
to have told us that these changes were a good idea, a bad idea, or had
absolutely no effect, for the least cost.

There will naturally be some experiments that are too risky to perform
on the live production network, unless we are absolutely sure that they
would actually help. Once the Test network has been verified to be a
good enough model to reproduce results from Production, we can then test
the forward direction of the pipeline, by performing those more risky
tuning experiments on Test, and verifying if any of the desirable
outcomes still carry forward to Prod.

In this way, we test the pipeline as we build it. This mail is already
very long, but you can image similar stages in reverse for CI, and for
comparing our testing infra to research infra, based on reproducibility
of research results.

Eventually, as we grow to trust this Testing network for performance
experimentation, we should plan to enhance it to other problem domains,
such as anonymity metrics, traffic analysis resistance metrics, better
Tor Browser user models than torperf, and so on.

III. How to Use a Scientific Pipeline

Once we have the full pipeline, or even parts of it, we're going to have
to make decisions about what features we spend our time implementing,
occupying testing network time, and toggling On/Off in Prod.

Tracking what is in each stage of this pipeline, and ensuring that we're
picking the right tasks to fill each stage without causing stalls, is
part of the job of managing and building this pipeline. (I imagine a
kanban/Trello model of the system where cards move from left to right
through each of the stages of the R&D pipeline from Section I, and the
total Points of the cards in each stage gives us indication on where
bottlenecks are).

For prioritization, we will use a Lowest-Hanging-Fruit-For-Best-Return-First
model for choosing which items to decide to move forward in the pipeline
where our time is involved. We should weigh predicted results vs the
development cost and testing/evaluation time for the expected gain. We
should also discount for any known anonymity risks (noting these as
things we would also like to both model and measure, eventually).

I don't intend this decision-making process to be fully algorithmic,
but I want to be able to refer to some kinds of criteria and evaluation
at every stage to make sure a particular feature/project is worth
continuing to later stages.

I also plan on iterating on this process: before we have resources,
we're limited to wall-clock On/Off time and will-to-test on Prod. As we
add testing network resources, we might bottleneck on simulation time.
Other stages will most likely bottleneck on developer time, or
researcher interest, or the level of quality of research produced in the
wild.

It will be a fine art to keep this thing moving (especially while we're
severely resource-constrained), but it won't be black magic.

IV. Where to start

We have plenty of items in each stage of the pipeline already, and we
have external researchers around the world who will contribute new "top
of funnel" research papers as well.

At the end of the pipeline, we have several ideas in deployment that are
in need of tuning, specifically: EWMA, KIST, CBT, 1-vs-more Guards,
preemptive circuit building, load balancing/torflow, and relay cutoffs.
Each of these systems has parameters that can be tuned that will affect
performance on the live Tor network. Most of them can also be switched
On/Off.

Several of these also have inherent anonymity metrics built right in to
them, which we will have to trade off against the gains they provide
(CBT and relay cutoffs are squarely in this category: their performance
parameter *is also* the anonymity metric; 1-vs-more Guards and load
balancing affect anonymity more subtly).

Other features, like preemptive circuit building and KIST merely need
tuning and do not substantially impact anonymity. Still other features
might not even be testable with torperf, or will require more realistic
user-perceived performance models. For example, 1-vs-more guard effects
requires repeated experiments with torperf with guards enabled.
Predictive circuit building will require an accurate model of the
typical user's usage to test properly.

To formalize this, we'll be creating a wiki page of planned experiments
on the live network, what they will involve, what we predict will
happen, what metrics we need to measure what will happen, and what
anonymity trade-offs are involved. Then we will go to the dirauth
operators with each experiment, ask them to flip the parameters as
needed according to a timing pattern and measure the results on our live
metrics.

For each experiment, to control for the effects of time on the live
network, we will reproduce the same On/Off parameter change at many
different times in a 24 hour period and many different days in the 7 day
cycle, over a long period of time. We will also watch for any trends we
notice during the "Off" stages, and repeat any other data points that
seem to be affected by external trends.

The record of our metrics data for these time periods will be preserved
in the highest level of extrainfo descriptor recording detail, as
separate archives, for us to easily reproduce these results on any
existing and future simulation and testing networks that we have access
to.

Each testing network will be tested to determine the lowest amount of
resources it needed to accurately reproduce the results from the live
network, on our chosen metrics. If any models have differences that
can't be accounted for simple scaling effects (like # of relays, total
bandwidth, etc), they will either be rejected, or inspected and fixed.

Additionally, for any historical instances of network overload, such as
the Snowden-era botnet and the recent onion service DoS attacks, we will
favor testing network models that show similar affects under similar
load conditions as our performance data showed in the live network
during that time.

V. Next Meeting Time and Agenda

In our next meeting, we should discuss this strategy, our potential
experiments, what metrics we would need for them, and what it would take
to collect, derive, and record them in a way that was easy to digest and
also easy to reproduce on our candidate networks.

From there, we can start to build a budget for the first bootstrapping
stages of this project, that would allow us to conduct these experiments
and choose a good network simulator or testing network model. After
that, we will have an idea of the expenses involved in the long term
strategy. 

Here is a Doodle poll covering the next two weeks of potential meeting
times. Please fill it out as soon as you can:
https://doodle.com/poll/n5xcc93fvtgtth57

-- 
Mike Perry
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 801 bytes
Desc: Digital signature
URL: <https://lists.torproject.org/cgi-bin/mailman/private/tor-scaling/attachments/20190406/abd95502/attachment.sig>