[or-cvs] r19162: {projects} start making a 2009 todo list out of the performance ideas. (projects/performance)

Thu Apr 30 06:38:15 UTC 2009

On Sat, Apr 11, 2009 at 01:54:33AM +0200, Karsten Loesing wrote:
> >   - 1.2, new circuit window sizes
> >     - Conclude whether the transition period will hurt as much as it
> >       seems like it will.
> >     * Pick a lower number, and switch to it.
> > Metrics: gather queue sizes from relays so we have a better sense of
> > what's actually going on.
> 
> Can you tell more details? Are there queues for circuits (1000 cells
> max) and for streams (500 cells max)? Is relay.c a good place to start
> looking for that code? Are we interested in average numbers of cells by
> streams, by circuits, by both, or only in the sum of all cells waiting?
> Should we write the number of cells waiting in those queues to disk
> every 1 or 10 seconds and use those data for evaluation?

There are two phases here. The first is to instrument the heck out
of a few relays that we run, so we can get more details and better
intuition about what we should be trying to learn. To that end, for
every OR connection, what is its current outbuffer size? How many active
circuits are there on that conn? ('active' means they have a cell queued
for writing onto the outbuffer.) For the active circuits, how many cells
are queued at each?

The outbuffer lengths can be found via various places that call
buf_datalen() in buffers.c. See also dumpstats() in main.c. For the
circuit queues, check out conn->active_circuits in various places of
relay.c. For the interaction between active circuits and outbuffers,
see connection_or_flushed_some() in connection_or.c.

It's probably also worth instrumenting some of the things you suggest
above, in terms of how many cells are traversing each circuit. Do most
circuits stay inactive, but a few become active, send their cells, become
inactive, get new cells and become active, and keep oscillating? Or
are there active circuits that just stay active for seconds at a time
because they can't clear their queue?

So we should collect these numbers first, and look at them, and figure
out if it changes our opinions about what else we want to know.

Then we should summarize the info in a way that we can collect and
log 60-second snapshots to watch for trends: median queue size, 90%,
10%; mean outbuf size, 90%, 10%. Mean number of active circuits,
90%, 10%. Whatever else looks interesting. And based on our "why does
everything happen at the beginning of each minute" questions, we should
probably try a period of something different than 60 seconds too, so we
can see whether we get way different answers. :)

While I'm at it, for relays that rate limit, I want to know what fraction
of each second they spend with empty write buckets. My guess is that
for most relays that rate limit, they have a full second's worth of
data queued up already, and at the top of each second, they pull off
one second's worth of bytes, send them, and then go dormant again until
the next second. That behavior sucks, and we have some ideas of how to
fix it (step one: lower the circuit window sizes, so there's less data
in flight on the network; step two, reduce the granularity of the token
bucket refills, so it sends bytes more regularly throughout the second),
but step zero is to confirm that this behavior actually happens.

What's the average time that a given cell spends waiting at our
relay? If we add a struct timeval to the packed_cell_t, we can learn
how long it takes until that cell gets flushed to the outbuf. Once we
know max/min/mean/median for that, we can decide if there's something we
can improve. Those numbers are also metrics that we should automate and
track over time, to give us a sense of intra-relay slowdown and whether
we're fixing it.

Also in there: do some relays refuse to read from us for multiple seconds
at a time? That is, above I speculated that stuff is waiting for the
next second before it goes out. But if there are several seconds of
stuff buffered, then the wait will be even longer. So even if the relay
that we instrument is fast enough and behaving well, this should give
us a sense of how loaded the other relays are.

The main reason to lower the circuit window is to reduce the clutter of
bytes in the middle of the network. By reducing the queues, we should
be able to decrease latency. So another item we need to measure is the
actual latency that Tor clients should be seeing. We should run a Tor
client somewhere that makes "typical" Tor circuits (i.e. just let Tor
choose its own paths, but set UseEntryGuards to 0, and only make requests
to port 80 so it builds circuits which exit there), and send pings every
so often, and track how long they take. One easy way to ping is to make
a request to an IP address that we know is refused by the last hop's
exit policy. Say, 127.0.0.1:80. Then measure the time between sending
the connect cell, and receiving the end cell. We expect this latency
to go down over time, a) because we lower the circuit window, and b)
because Tor has on-average-better circuits based on Mike Perry's plans.

Once we've done this instrumenting for circuits on ORConns, we should
think about whether we need to do the same things for instrumenting edge
streams and how they add cells to their circuits, and how they flush
them on the socks side. But I hope that we get most of what we need
out of looking at how relays behave inside the network.

So that was phase one. Phase two is to roll out a few of these changes
to every relay, and then have them record summary stats and publish them
in their extra-info documents. That way we'll get a better cross-section
of the network rather than just seeing the relays that we set up. This
phase can come later once we know more about what we want.

[High priority, since I really want to get the baseline measurements
in place so we can start rolling out our changes.]

> >   - 2.5, Default exit policies
> >     * Change Vidalia's default exit policy to not click "other protocols".
> >     D let exit relays specify some destination networks/ports that get
> >       rate limited further.
> > Metrics: At what fraction of exit relays allowing a given port out
> > do connections to that port start to suffer? That is, if even 5%
> > of the relays (by bandwidth) allowing a port to exit are enough for
> > most connections to that port to work fine, then we're going to have
> > a tough time pushing unwanted traffic off the network just by changing
> > some exit policies. (Alas, this question is messy because it pretends
> > that the amount of traffic generated for port x is independent of x.
> > How to phrase it so it's more useful?)
> 
> Okay, I'm not sure if I understand that question. What exactly do you
> want to have measured here?

Let's say port 6881 is the default bittorrent port. It's refused in the
default exit policy, but there are a few relays out there who allow
it anyway. If I'm doing a connection to port 6881, that means I have
to exit from one of those few that allow it. We theorize that the few
relays who allow it are massively overloaded (e.g. have higher latencies
and give lower throughput) compared to other relays that advertise the
same bandwidth, because they're attracting all the default bittorrent
streams. So the first question is: is this really true? We could look at
Mike's torflow stats and look at the exit policies and see if open exit
policies are correlated to some of the slower relays. The next question
is: if you try to fetch a large file from port 6881 vs from port 6880,
how big a performance difference do you see?

Then there are other ports, like 119 or 25, that might not attract as
much traffic. Are the relays that allow those as overloaded as the ones
that allow 6881? This question may be hard to answer, since I imagine
there aren't many relays in either of these categories, and the few that
do exist probably overlap.

The ultimate goal is to answer the question: If we reject more ports in
the default exit policy, a) would we reduce the performance for some
classes of users (theoretically the answer is "of course yes", but
the question is "how much in practice?"), and b) would we thus improve
performance for the folks who use the remaining ports? I'm thinking in
particular of the folks who use port 80.

I'm not sure how to measure this though, because the "reduce performance
for some classes" step is designed to make them stop trying to use Tor,
and we'd have to have some model for how successful we'll be, and it
gets messy very quickly. Hm. Suggestions? :)

In any case, we can still try to answer the preliminary question, which
is "does restricting a port in the default exit policy actually have a
big impact on the performance you get if you want to use that port?"

[Medium priority, because I don't expect it'll be a huge change, so
the relay-buffer and circuit-window stuff should come first]

> >   - 3.6, incentives to relay
> >     - Sort out how to do circuit priority in practice. I think the only
> >       answer here is to make different TLS connections for different
> >       priorities. (Otherwise other people can free-ride on your
> >       high-priority conns.)
> > Metrics: what period of time should the gold star status last? That is,
> > What period of time, taken as a rolling snapshot of which relays are
> > present in the network, guarantees a sufficiently large anonymity set
> > for high-priority relays?
> 
> This goes in the direction of the churn measurements for my thesis. But
> I'm unsure what exactly the question is. You want to know how many of
> the relays at time X are still running at time Y? Or, maybe only a
> subset of relays (which criteria)?

If I know that my target was a relay within the past X hours, what is the
expected size of his anonymity set for various values of X? Plotting the
"number of relays that could have been my target" vs X will give us a
sense of the churn of the network.

But let's say my target has linkable behavior. He logs into wikileaks
with a distribution centered around every L hours, and each time he
does that, I know he was a relay within the past X hours. Now we can
do an intersection attack. For various values of L (a day, a few days,
a week, a month), what do the above entropy-vs-X graphs look like?

I would guess that a small X is going to shrink the entropy quickly. But
if we use an X of a week or two, does it still shrink quickly, or is
there a large stable core of the network that still provides pretty good
cover? And if there is, is the stable core stable enough that the value
of X doesn't much matter?

[Low priority, since there's a good chance the conclusion will be "don't
do that"]

> >   - 4.2, getting better bandwidth estimates
> > Metrics: how accurate are the ten-second-bandwidth-burst advertised
> > numbers anyway, in terms of guessing capacity? Steven says we're at 50%
> > load, but is that just because our advertised bandwidth is a function
> > of our recent load?
> 
> How would accuracy be measured? How do I learn how much 100% of the
> capacity of a relay are?
> 
> >     - What is "true" capacity anyway?
> > Metrics: What other algorithms can we use to produce a more accurate
> > advertised bandwidth?
> 
> Is this a question that can be answered by metrics? Can you give more hints?

Hell if I know. I guess that means this one should wait. ;)

Mike Perry has a plan to use torflow to measure the capacity he actually
receives from each relay, and compare it to the capacity he receives
from its peers (that is, relays that advertise similar bandwidths). Then
we will modify the bandwidth values in the networkstatus consensus based
on the measurements, with the hopes of making peers more equal. At that
point there will be a feedback loop, where advertising a higher number
attracts more users, thus slowing it down; and advertising a lower
number sheds users, thus speeding it up. So if we can get the tuning
right, the advertised bandwidth numbers will more accurately reflect
reality.

Then the new set of metrics questions will come in: how come some numbers
are so different from what the relays themselves are claiming? Are there
ways we can help the relays come up with a number that's closer to what
torflow decided they should use, so torflow doesn't have to correct it
as much?

Another metrics question is to reproduce the Snader-Borisov results, by
running an instrumented relay that records how fast it thinks the other
relays are, and see how closely the measurements match a) the bandwidth
that the relay advertises, and b) the bandwidth in the consensus (well,
the load balancing ratio in the consensus, really). Then get a bunch of
people to run the instrumented version, and see if the median or mean of
their opinions is more useful. I bet Robin would be happy to help with
this -- I almost rolled out a version of their patch in 2008, but then
I had second thoughts because I was unsure how much anonymity could be
broken by an adversary who sees votes by every relay on the bandwidth
they've seen from every other relay, and also has some network traces
of his own. There's another research question for somebody to tackle. :)

[Low priority, since Mike should do his part first]

> >   - 4.5, Older entry guards are overloaded
> > Metrics: compare "how fast each relay should be based on its advertised
> > capacity" with "how long the relay has had the guard flag", to see how
> > big an issue this is really.
> 
> Okay, that means finding a possible correlation between advertised
> capacity and time since getting the Guard flag for the first time.
> Should be possible, but I need to think harder about doing this
> efficiently with the current database.

Yep. It actually isn't "time since getting the Guard flag for the first
time" -- it's "time spent in the network with the Guard flag", since
the users you accumulate as a guard are, well, cumulative.

[High priority, since I changed the code in 0.2.1.14-rc, so our period
of being able to measure the old situation is disappearing as people
upgrade.]

> > Metrics: How many relays does a client touch over time x, given that they
> > drop old guards y seconds after choosing them? Even if y is infinite,
> > we have some number based on guards going away. How does x grow as we
> > reduce y?
> 
> Hey, x has two meanings here. ;) The question should be "How many relays
> z does a client touch over time x, given that they drop old guards y
> seconds after choosing them?" Maybe we want to fix the time x to, say, 1
> month?

Sounds good. I actually picked 4-8 weeks for y in 0.2.1.14-rc, so I'd be
curious to know the answer for several values of x: 1 month, 2 months,
6 months, 12 months, etc.

Basically, my hope is that a value of 1 month for y is approximately
the same as a value of infinity for y, even for large values of x.

On the other hand, if y=1mo and y=\inf actually are the same, then
shouldn't the network be rebalancing over time, and the old Guards
shouldn't be so overloaded? I don't actually know what to expect here. I
look forward to some actual data. :)

[Medium priority, since we should learn a good value for y and use it,
rather than continually picking new bad values and splintering the
anonymity set too much.]

> >     * Pick a conservative y like six months, and implement.
> >     D Reduce y based on the results of the metrics. (don't partition
> >       clients too far by tor version though.)
> > Metrics: if we were more flexible in our Guard stability criteria, how
> > many more relays would get the Guard flag? How would that influence the
> > above numbers? I'd like to become very flexible so more than half of
> > the relays get to be guards. Are there cutoffs that are reasonable and
> > fit naturally into the data from the past few years?
> 
> I have started playing with the selection criteria to see how many
> Fast/Stable/Guard nodes we'd have ended up with in the past. If the
> stupid database finishes anytime soon, we'll have results tonight or
> tomorrow. The question how that would have influenced the above numbers
> would be next.

Great. Let me know how that goes. :)

[Medium priority, since right now there are far too few guards.]

We'll still want to tackle the edge-case where a user picks three
low-bandwidth guards and keeps them for a month. Having a high variance
of outcomes from guard selection means that some users will find Tor fast
and some will find it slow, and that leads to hack recommendations like
"if Tor is slow then delete your state file and restart".

I'm not sure how to resolve this. The fragile kludge is to have some
smart heuristics, e.g. Alice tries again if she chooses three guards whose
bandwidths sum to less than 100KB. The overkill way is to make RealGuard
and AdequateGuard flags, and make sure Alice picks at least one RealGuard.

> > Metrics: if we're more flexible in our Guard speed criteria, how does
> > that impact the speed that clients should expect? Originally we avoided
> > 20KB/s relays as guards, because "then clients can't ever get more than
> > 20KB/s". But they can't get that now anyway.
> 
> What metric would answer this question? The distribution of advertised
> bandwidth for different Guard selection criteria?

I'm not sure. I was going to suggest to make paths through the first
set and then through the second set, and compare the performance you
see. But the situation is so broken right now wrt guard load that an
experiment like that wouldn't really be useful.

Another option would be to do all our changes, let everything settle down,
and then do the two sets of measurements and compare. I'd be tempted to
include "be more flexible about who can be a guard" in those changes,
and then ask the question in reverse: if we were more strict, how much
faster would Tor get for the client?

> >   - 5.2, better timeouts for giving up on circuits/streams
> >     * clients gather data about circuit timeouts, and then abandon
> >       circuits that take more than a std dev above that.
> > Metrics: Right now we abandon the circuit after 10 seconds for the first
> > try. What are the download stats if we don't abandon it? What "try a
> > new one" timeouts will minimize the number of circuits we go through,
> > while also minimizing the time-until-user-gets-website?
> 
> Hmm. We might completely remove the 10-seconds timeout and see how long
> it takes until the stream is attached (or if it fails). From these data
> we could derive a better timeout. Is that what you have conceived?

Yes, that sounds like a great start. Once we know how long they take in
practice, then we can see what we think the 10s timeout is doing to the
situation (whether it's helping or hurting).

[Medium priority, since I think we can get a big win here]

Thanks!
--Roger