Hi folks,
I've been talking to a longtime exit relay operator, who is in the odd position of having a good 1gbit network connection, but only one IP address.
He used to push an average of 500mbit on his exit relay, but then the HSDir DoS flatlined his relay for a while (!), and now, perhaps due to the bwauth variability, his exit relay only recovered to maybe 200mbit. He is running a second exit relay on that IP address, but also perhaps due to the bwauth variability, it hasn't attracted much attention either.
The real answer is to fix the bandwidth measurement infrastructure. But while we're patiently waiting for progress there, I've been thinking to raise moria1's AuthDirMaxServersPerAddr to 4, i.e. to allow 4 relays per IP address onto the network.
I don't think it would significantly increase our risk due to Sybil attacks, whereas there is a clear benefit in terms of some more 100's of mbits of good exit relay capacity.
I will propose this change to the dir-auth list in a bit, but here is your chance to point out surprising impacts that I haven't thought of.
Thanks! --Roger
Hi all,
On 2 Jun 2019, at 05:22, Roger Dingledine arma@torproject.org wrote:
I've been talking to a longtime exit relay operator, who is in the odd position of having a good 1gbit network connection, but only one IP address.
He used to push an average of 500mbit on his exit relay, but then the HSDir DoS flatlined his relay for a while (!), and now, perhaps due to the bwauth variability, his exit relay only recovered to maybe 200mbit. He is running a second exit relay on that IP address, but also perhaps due to the bwauth variability, it hasn't attracted much attention either.
I'd like to confirm the problem before we make major network changes. (And I'd like to know how widespread it is.)
Which bandwidth authorities are limiting the consensus weight of these relays? Where are they located?
Are the relays' observed bandwidths limiting their consensus weight?
Here's how the operator can find out: https://trac.torproject.org/projects/tor/wiki/doc/MyRelayIsSlow#TorNetworkLi...
If the relays are being measured by longclaw's sbws instance, we should also look at their detailed measurement diagnostics.
longclaw's bandwidth file is available at: http://199.58.81.140/tor/status-vote/next/bandwidth
The real answer is to fix the bandwidth measurement infrastructure.
Do we have funding to continue to improve the bandwidth measurement infrastructure? Or to maintain it?
If we don't have any grants in the pipeline, now would be a good time to start some.
But while we're patiently waiting for progress there, I've been thinking to raise moria1's AuthDirMaxServersPerAddr to 4, i.e. to allow 4 relays per IP address onto the network.
I don't think it would significantly increase our risk due to Sybil attacks, whereas there is a clear benefit in terms of some more 100's of mbits of good exit relay capacity.
I will propose this change to the dir-auth list in a bit, but here is your chance to point out surprising impacts that I haven't thought of.
Splitting bandwidth between multiple relays has privacy implications, because traffic is easier to track between instances.
It also increases the size of the consensus.
So we should choose a value for AuthDirMaxServersPerAddr that is a compromise between these competing goals.
Why is 4 better than 3 or 5?
T
Hi,
On 2 Jun 2019, at 13:30, teor teor@riseup.net wrote:
On 2 Jun 2019, at 05:22, Roger Dingledine arma@torproject.org wrote:
I've been talking to a longtime exit relay operator, who is in the odd position of having a good 1gbit network connection, but only one IP address.
He used to push an average of 500mbit on his exit relay, but then the HSDir DoS flatlined his relay for a while (!), and now, perhaps due to the bwauth variability, his exit relay only recovered to maybe 200mbit. He is running a second exit relay on that IP address, but also perhaps due to the bwauth variability, it hasn't attracted much attention either.
I'd like to confirm the problem before we make major network changes. (And I'd like to know how widespread it is.)
Which bandwidth authorities are limiting the consensus weight of these relays? Where are they located?
Are the relays' observed bandwidths limiting their consensus weight?
Here's how the operator can find out: https://trac.torproject.org/projects/tor/wiki/doc/MyRelayIsSlow#TorNetworkLi...
If the relays are being measured by longclaw's sbws instance, we should also look at their detailed measurement diagnostics.
longclaw's bandwidth file is available at: http://199.58.81.140/tor/status-vote/next/bandwidth
For example, this relay is limited by Comcast's poor peering to MIT and Europe. We've spoken to a few Comcast relay operators with similar issues.
https://lists.torproject.org/pipermail/tor-relays/2019-June/017376.html
Adding more tor instances on networks like Comcast would only slow down Tor.
The real answer is to fix the bandwidth measurement infrastructure.
Do we have funding to continue to improve the bandwidth measurement infrastructure? Or to maintain it?
If we don't have any grants in the pipeline, now would be a good time to start some.
I wrote to the grants team about bandwidth authority funding.
T
On Sun, Jun 02, 2019 at 01:30:18PM +1000, teor wrote:
Which bandwidth authorities are limiting the consensus weight of these relays? Where are they located?
The one in question is in Sweden: https://metrics.torproject.org/rs.html#details/D5F2C65F4131A1468D5B67A8838A9...
It has votes of: w Bandwidth=10000 Measured=65200 w Bandwidth=10000 Measured=70000 w Bandwidth=10000 Measured=74200 w Bandwidth=10000 Measured=77000 w Bandwidth=10000 Measured=99400 w Bandwidth=10000 Measured=102000
and it currently reports a self-measured peak at 56MBytes/s.
So one could interpret the current bwauths as saying that it is a bit above average compared to other 56MByte/s relays. Maybe that's because the other 56MByte/s relays got better lately, or maybe that's because there's less overall traffic on the network, but my guess is it's because it's stuck in that rut because the bwauths are not good at realizing it could go a lot faster.
Are the relays' observed bandwidths limiting their consensus weight?
bandwidth 89600000 102400000 55999620
So it looks like no.
If the relays are being measured by longclaw's sbws instance, we should also look at their detailed measurement diagnostics.
Looks like yes, it is measured:
w Bandwidth=10000 Measured=78000
I look forward to hearing about these detailed measurement diagnostics. :)
Do we have funding to continue to improve the bandwidth measurement infrastructure? Or to maintain it?
If we don't have any grants in the pipeline, now would be a good time to start some.
Agreed.
sbws was always intended (as far as I recall) to be a bandaid to make the torflow approach more maintainable, while we continue to await research on better-but-still-workable approaches. I hear the NRL folks have another design they've been working on that sounds promising.
I will propose this change to the dir-auth list in a bit, but here is your chance to point out surprising impacts that I haven't thought of.
Splitting bandwidth between multiple relays has privacy implications, because traffic is easier to track between instances.
Right. It also causes more TCP connections to be used overall than would be needed if we could make individual relays work better.
It also increases the size of the consensus.
So we should choose a value for AuthDirMaxServersPerAddr that is a compromise between these competing goals.
Why is 4 better than 3 or 5?
I figured doubling 2 would make security intuitions simpler.
(4 is also the value we used to use for AuthDirMaxServersPerAuthAddr.)
--Roger
Hi,
On 2 Jun 2019, at 18:21, Roger Dingledine arma@torproject.org wrote:
On Sun, Jun 02, 2019 at 01:30:18PM +1000, teor wrote:
Which bandwidth authorities are limiting the consensus weight of these relays? Where are they located?
The one in question is in Sweden: https://metrics.torproject.org/rs.html#details/D5F2C65F4131A1468D5B67A8838A9...
In your first email, you said:
He used to push an average of 500mbit on his exit relay, but then the HSDir DoS flatlined his relay for a while (!), and now, perhaps due to the bwauth variability, his exit relay only recovered to maybe 200mbit. He is running a second exit relay on that IP address, but also perhaps due to the bwauth variability, it hasn't attracted much attention either.
The relay's recent history is a bit more complicated. This fingerprint has only been around since October 2018. It pushed 500 mbit from November 2018 to January 2019, then failed from January to March 2019. Its other bandwidths are about the same as they are now, at around 250 mbit.
And there are two exits on this machine. Here's the other one:
https://metrics.torproject.org/rs.html#details/6B37261F1248DA6E6BB924161F8D7...
It's RelayBandwidthRate-limited to 15 MBytes/s, so the operator's first step should be to remove this limit, and wait a week for the bandwidths to stabilise.
The first exit is also rate-limited to 85 MBytes/s. It might be a good idea to remove both limits at the same time.
Tor is only using about 50% of advertised exit bandwidth right now. These particular exits are using 35% and 60% of their bandwidth limits.
So I don't see anything unusual happening here.
Can you hold off on your proposed changes until we see what happens after the bandwidth limits are removed?
I'll send a separate detailed email with the sbws diagnostics.
T
On Sun, Jun 02, 2019 at 10:11:46PM +1000, teor wrote:
The relay's recent history is a bit more complicated. This fingerprint has only been around since October 2018.
Actually, no, it's been around since something like 2006. But it looks like it was a small relay in recent years, until it became huge in 2018.
And there are two exits on this machine. Here's the other one:
https://metrics.torproject.org/rs.html#details/6B37261F1248DA6E6BB924161F8D7...
It's RelayBandwidthRate-limited to 15 MBytes/s, so the operator's first step should be to remove this limit, and wait a week for the bandwidths to stabilise.
I asked about this, and this other exit is not on the same machine. It's on the same IP address, yes, but it's running on a different computer, and that computer is lacking AESNI and other things that would let it scale well with Tor's current multithreading situation.
I encouraged the operator to raise the Burst on che1, since it seems pretty much maxed out (which means every second some users are suffering by having their traffic rate limited). And also to consider moving che1 onto the same hardware as che, because maybe it can scale better even though it would be sharing the same cpu with che.
The first exit is also rate-limited to 85 MBytes/s. It might be a good idea to remove both limits at the same time.
Well, closer to 90mbytes, or about 717 mbits/s, with a burst up to 819 mbits/s (depending on your definition of m). I can't imagine that raising those rate limits will make a big difference. But you're right that he should raise them anyway, just to rule it out as another variable.
Can you hold off on your proposed changes until we see what happens after the bandwidth limits are removed?
Yes. But it seems like a poor tradeoff to me, to continue delaying exit relay operators from contributing as much as they want to contribute, when we could apply this bandaid while continuing to work on the better fixes.
That said, for this particular instance, I am beginning to think that raising AuthDirMaxServersPerAddr to 3 (rather than 4) would be a good next step for seeing how that goes.
Thanks! --Roger
Hi,
Here are some detailed diagnostics.
My overall conclusion is: there isn't much bandwidth left on that exit.
On Sun, Jun 02, 2019 at 01:30:18PM +1000, teor wrote:
Which bandwidth authorities are limiting the consensus weight of these relays? Where are they located?
The one in question is in Sweden: https://metrics.torproject.org/rs.html#details/D5F2C65F4131A1468D5B67A8838A9...
It has votes of: w Bandwidth=10000 Measured=65200 w Bandwidth=10000 Measured=70000 w Bandwidth=10000 Measured=74200 w Bandwidth=10000 Measured=77000 w Bandwidth=10000 Measured=99400 w Bandwidth=10000 Measured=102000
and it currently reports a self-measured peak at 56MBytes/s.
So one could interpret the current bwauths as saying that it is a bit above average compared to other 56MByte/s relays. Maybe that's because the other 56MByte/s relays got better lately, or maybe that's because there's less overall traffic on the network, but my guess is it's because it's stuck in that rut because the bwauths are not good at realizing it could go a lot faster.
Well, it's not a simple geographical bias. That's the most common measurement issue we see. The closest bwauth has the median measurement, and the North American bwauths are evenly distributed above and below the median.
Interestingly, sbws measures just slightly above the median, so this also isn't an instance of torflow's "stuck in a partition" bug.
It would be nice to have some evidence that the relay is stuck, rather than just slow, poorly connected, or variable.
The Relays Search bandwidth history shows that both relays on that machine vary a lot: https://metrics.torproject.org/rs.html#details/D5F2C65F4131A1468D5B67A8838A9... https://metrics.torproject.org/rs.html#details/6B37261F1248DA6E6BB924161F8D7...
But it doesn't tell us *why* they vary.
Are the relays' observed bandwidths limiting their consensus weight?
bandwidth 89600000 102400000 55999620
So it looks like no.
I'm sorry, my question was poorly phrased.
The observed bandwidth is part of the torflow/sbws scaling algorithm, so it's always limiting the consensus weight.
In this case, if the relay observed more bandwidth, it would get about 1.3x that bandwidth as its consensus weight.
If the relays are being measured by longclaw's sbws instance, we should also look at their detailed measurement diagnostics.
Looks like yes, it is measured:
w Bandwidth=10000 Measured=78000
I look forward to hearing about these detailed measurement diagnostics. :)
We wrote a spec to answer all^ your questions: https://gitweb.torproject.org/torspec.git/tree/bandwidth-file-spec.txt
^ except for these undocumented fields: https://trac.torproject.org/projects/tor/ticket/30726
Here are some of the diagnostics from the latest bandwidth file:
1559468088 version=1.4.0 earliest_bandwidth=2019-05-28T09:35:16 file_created=2019-06-02T09:35:04 generator_started=2019-05-19T14:04:34 latest_bandwidth=2019-06-02T09:34:48
sbws has been running for a few weeks, and its still measuring.
number_consensus_relays=6552 number_eligible_relays=6302 percent_eligible_relays=96
It's measuring 96% of Running relays.
recent_measurement_attempt_count=329137 recent_measurement_failure_count=301111
It has a 90% measurement failure rate, which is way too high: https://trac.torproject.org/projects/tor/ticket/30719
But it's still measuring 96% of Running relays, so this bug might not be as much of a blocker as we thought.
recent_measurements_excluded_error_count=892 recent_measurements_excluded_few_count=647 recent_measurements_excluded_near_count=232 recent_measurements_excluded_old_count=0
1-4% of measurements are excluded for various reasons. We think that's normal. But it's hard to check, because torflow has limited diagnostics.
software=sbws software_version=1.1.0 time_to_report_half_network=224554
2.6 days is quite a long time to measure half the network. Probably due to #30719.
And here are the diagnostics for that relay, split over a few lines:
bw=7700
This is the vote measured bandwidth.
bw_mean=803269 bw_median=805104
This is the raw measured bandwidth, 784 KBytes/s. This is a *lot* lower than the observed bandwidth of 56 MBytes/s.
The most likely explanation is that the relay doesn't have much bandwidth left over.
But maybe this sbws instance needs more bandwidth. If we fixed #30719, there might be a lot more sbws bandwidth for successful measurements.
consensus_bandwidth=75000000 consensus_bandwidth_is_unmeasured=False
This is the consensus measured bandwidth in the sbws client's consensus, converted from scaled-kilobytes to scaled-bytes.
desc_bw_avg=89600000 desc_bw_bur=102400000
This relay is rate-limited to 85 Mbytes/s.
Maybe it would have more bandwidth if it wasn't rate-limited.
desc_bw_obs_last=54690734 desc_bw_obs_mean=54690734
sbws is operating off a descriptor, where the observed bandwidth was: 54690734
But the relay is now reporting: 55999620
So we might see the consensus weight increase a little bit in the next day or so.
error_circ=0 error_destination=0 error_misc=0 error_second_relay=0 error_stream=0
This relay has no measurement errors.
master_key_ed25519=Q2Ft/AsNiru+HEx4KRdRxhnuohOs3ByA0t816gUG+Kk nick=che node_id=$D5F2C65F4131A1468D5B67A8838A9B7ED8C049E2
Yes, I am analysing the right relay.
relay_in_recent_consensus_count=310
It has been running for a while. This consensus count is surprising, but there's no spec for it, so I don't know what it's meant to be: https://trac.torproject.org/projects/tor/ticket/30724 https://trac.torproject.org/projects/tor/ticket/30726
relay_recent_measurement_attempt_count=1 relay_recent_priority_list_count=1
1 measurement in the last 5 days is very low. Probably due to #30719.
success=4
4 successful measurements is good, but it's weird that there is only 1 recent measurement attempt. These figures should be similar: https://trac.torproject.org/projects/tor/ticket/30725
time=2019-06-01T14:56:32
It was last measured about 18 hours ago.
T
Hi all,
This is an email about an alternative proposal:
Let's deploy sbws to some more bandwidth authorities.
Do we have funding to continue to improve the bandwidth measurement infrastructure? Or to maintain it?
If we don't have any grants in the pipeline, now would be a good time to start some.
Agreed.
sbws was always intended (as far as I recall) to be a bandaid to make the torflow approach more maintainable, while we continue to await research on better-but-still-workable approaches. I hear the NRL folks have another design they've been working on that sounds promising.
There were a bunch of bugs in sbws that seemed to be excluding some relays. So we stopped deploying sbws to any more bandwidth authorities.
In March and April, I said that we should block further deployments. But I did some more analysis today, and I don't think those bugs are actually blockers.
In #29710, it looked like sbws was missing about 1000 relays. But it turns out that those relays aren't actually Running: https://trac.torproject.org/projects/tor/ticket/29710#comment:13
In #30719, 90% of sbws measurement attempts fail. But these are internal errors, not network errors. So it looks like it's a relay selection bug in sbws: https://trac.torproject.org/projects/tor/ticket/30719#comment:2
So I have an alternative proposal:
Let's deploy sbws to half the bandwidth authorities, wait 2 weeks, and see if exit bandwidths improve.
We should measure the impact of this change using the tor-scaling measurement criteria. (And we should make sure it doesn't conflict with any other tor-scaling changes.)
If we do decide to change AuthDirMaxServersPerAddr, let's work out how many new relays would be added to the consensus straight away. There shouldn't be too many, but let's double-check.
T
-- teor ----------------------------------------------------------------------
On Sun, Jun 02, 2019 at 10:43:14PM +1000, teor wrote:
Let's deploy sbws to half the bandwidth authorities, wait 2 weeks, and see if exit bandwidths improve.
We should measure the impact of this change using the tor-scaling measurement criteria. (And we should make sure it doesn't conflict with any other tor-scaling changes.)
Rolling out more sbws measurers sounds good to me.
But, maybe I haven't been following, but isn't the first plan for sbws to replace torflow but have identical behavior? And then we can work on changing it to have better behavior?
I ask because in that case switching to more sbws measurers should not cause the exit bandwidths to improve, until we then change the measurers to measure better.
If we do decide to change AuthDirMaxServersPerAddr, let's work out how many new relays would be added to the consensus straight away. There shouldn't be too many, but let's double-check.
$ grep "^r " moria1-vote | cut -d' ' -f7 | sort | uniq -c | sort -n
yields these IP address counts that have more than 2 relays on them:
3 163.172.132.167 [only 2 actually Running] 3 80.210.238.199 [it's a snap package, 0 Running] 4 78.146.180.236 [only 1 actually Running] 5 93.202.254.196 [0 Running] 6 218.221.205.161 [they're all on the same port, 0 Running] 7 212.24.106.116 [at least 4 Running] 8 79.137.70.81 [0 Running] 9 159.89.4.187 [0 Running] 10 212.24.110.13 [at least 4 Running]
So I believe that if we change it to 4 relays per IP address, we would get 4 more relays in the consensus currently.
And if we change it to 3 relays per IP address, we would get only 2 more relays currently.
Of course, once we make it clearer that big relays can run more instances per IP address, some people might choose to simplify their set-ups.
--Roger
Hi,
On 3 Jun 2019, at 17:48, Roger Dingledine arma@torproject.org wrote:
On Sun, Jun 02, 2019 at 10:43:14PM +1000, teor wrote: Let's deploy sbws to half the bandwidth authorities, wait 2 weeks, and see if exit bandwidths improve.
We should measure the impact of this change using the tor-scaling measurement criteria. (And we should make sure it doesn't conflict with any other tor-scaling changes.)
Rolling out more sbws measurers sounds good to me.
But, maybe I haven't been following, but isn't the first plan for sbws to replace torflow but have identical behavior? And then we can work on changing it to have better behavior?
No, we fixed some obvious torflow bugs and design flaws.
Here are some details:
Let's talk engineering tradeoffs.
sbws had a few conflicting goals: * create a modern bandwidth scanner implementation * produce results that are similar to torflow * be ready to deploy in 2019
Here's how we resolved those tradeoffs: * use modern designs, libraries, and protocols when building sbws * compare sbws results against torflow, and identify any issues: * when torflow is obviously wrong, do something better in sbws * when sbws is obviously wrong, log a bug against sbws, and triage it * when the results differ by a small amount, accept that difference
See these tickets for more details: https://trac.torproject.org/projects/tor/ticket/27339 https://trac.torproject.org/projects/tor/ticket/27107
Here are some network health checks we are doing as we deploy sbws: https://sbws.readthedocs.io/en/latest/monitoring_bandwidth.html
Here are some FAQs about the design, and the bandwidth file spec: https://sbws.readthedocs.io/en/latest/faq.html https://gitweb.torproject.org/torspec.git/tree/bandwidth-file-spec.txt
It would be great to have more design documentation, but keeping that documentation up to date is a lot of work. And we needed to deliver working code, too.
I ask because in that case switching to more sbws measurers should not cause the exit bandwidths to improve, until we then change the measurers to measure better.
One of the design flaws that we fixed was torflow's "scanner partitions".
Relays can get stuck in a slow torflow scanner partition, and never improve their measurements.
But in sbws, each relay is measured against a random faster relay. sbws tries to choose relays that are at least 2x faster than the target.
So some stuck relay bandwidths should improve under sbws, as long as we have enough sbws instances (about half, I think).
That said, there are still some bugs in sbws. Some of those bugs were copied from torflow. Others are new bugs. sbws has detailed diagnostics that will help us chase down and fix these bugs.
And we can also make design changes. But let's stabilise sbws first, and fix any high-impact bugs.
T
teor:
I have an alternative proposal:
Let's deploy sbws to half the bandwidth authorities, wait 2 weeks, and see if exit bandwidths improve.
We should measure the impact of this change using the tor-scaling measurement criteria. (And we should make sure it doesn't conflict with any other tor-scaling changes.)
I like this plan. To tightly control for emergent effects of all-sbws vs all-torflow, ideally we'd switch back and forth between all-sbws and all-torflow on a synchronized schedule, but this requires getting enough measurement instances of sbws and torflow for authorities to choose either the sbw file, or the torflow file, on some schedule. May be tricky to coordinate, but it would be the most rigorous way to do this.
We could do a version of this based on votes/bwfiles alone, without making dirauths toggle back and forth. However, this would not capture emergent effects (such as quicker bwadjustments in sbws due to decisions to pair relays with faster ones during measurement). Still, even comparing just votes would be better than nothing.
For this experiment, my metric of choice would be "Per-Relay Spare Network Capacity CDF" (see https://trac.torproject.org/projects/tor/wiki/org/roadmaps/CoreTor/Performan...), for both the overall consensus, and every authority's vote. It would also be useful to generate separate flag breakdowns of this CDF (ie produce separate CDFs for Guard-only, Middle-only, Exit-only, and Guard+Exit-only relays).
In this way, we have graphs of how the votes and the consensus distribution of the difference between self-reported and measured values across the network. We should be able to pinpoint any major disagreements in how relays are measured compared to their self-reported values with these metrics. (In the past, karsten produced very similar sets of CDFs of just the measured values per vote when we were updating bwauths, and we compared the shape of the measured CDF, but I think graphing the difference is more comprehensive).
We should also keep an eye on CDF-DL and the failure rainbow metrics, as they may be indirectly affected by improvements/regressions in load balancing, but I think the distribution of "spare capacity" is the first order metric we want.
Do you like these metrics? Do you think we should be using different ones? Should we try a few different metrics and see what makes sense based on the results?
If we do decide to change AuthDirMaxServersPerAddr, let's work out how many new relays would be added to the consensus straight away. There shouldn't be too many, but let's double-check.
Hrmm.. This may be hard to determine, and it would only make immediate difference if many relay operators already have more than 2 relay instances actively trying to run on a single IP, such that the additional ones are still running but currently being rejected constantly.. I'm guessing this is not common, and relay operators will have to manually decide to start more instances.
I also don't think that these approaches need to be either/or. I think there are many independent reasons to allow more relays per IP (tor is single-threaded and caps out somewhere between 100-300Mbit per instance depending on CPU and AES acceleration, so many fast relay operators do the multi-instance thing already, if they have the spare IPs).
I also think that if I'm right about most relay operators needing to make this decision manually, the effect of allowing 4 nodes per IP will mostly blend in with normal network churn over time.
So, as long as we tightly control switching sbws vs torflow and have result files from each for the duration of the experiment, I think that we can do both of these things at once. There's going to be capacity and load churn like this over time naturally, anyway. This switching-back-and-forth methodology is meant to control for that.
Mike Perry:
teor:
I have an alternative proposal:
Let's deploy sbws to half the bandwidth authorities, wait 2 weeks, and see if exit bandwidths improve.
We should measure the impact of this change using the tor-scaling measurement criteria. (And we should make sure it doesn't conflict with any other tor-scaling changes.)
I like this plan. To tightly control for emergent effects of all-sbws vs all-torflow, ideally we'd switch back and forth between all-sbws and all-torflow on a synchronized schedule, but this requires getting enough measurement instances of sbws and torflow for authorities to choose either the sbw file, or the torflow file, on some schedule. May be tricky to coordinate, but it would be the most rigorous way to do this.
We could do a version of this based on votes/bwfiles alone, without making dirauths toggle back and forth. However, this would not capture emergent effects (such as quicker bwadjustments in sbws due to decisions to pair relays with faster ones during measurement). Still, even comparing just votes would be better than nothing.
For this experiment, my metric of choice would be "Per-Relay Spare Network Capacity CDF" (see https://trac.torproject.org/projects/tor/wiki/org/roadmaps/CoreTor/Performan...), for both the overall consensus, and every authority's vote. It would also be useful to generate separate flag breakdowns of this CDF (ie produce separate CDFs for Guard-only, Middle-only, Exit-only, and Guard+Exit-only relays).
In this way, we have graphs of how the votes and the consensus distribution of the difference between self-reported and measured values across the network.
Arg, I misspoke here. The metric from that performance experiment page is the difference between peak observed bandwidth and bw history. This will still be interesting to measure load balancing effects, but it does not directly involve the measured values.. We may also want a metric that directly compares properties of the measured vs advertised values. See below.
We should be able to pinpoint any major disagreements in how relays are measured compared to their self-reported values with these metrics. (In the past, karsten produced very similar sets of CDFs of just the measured values per vote when we were updating bwauths, and we compared the shape of the measured CDF, but I think graphing the difference is more comprehensive).
We should also keep an eye on CDF-DL and the failure rainbow metrics, as they may be indirectly affected by improvements/regressions in load balancing, but I think the distribution of "spare capacity" is the first order metric we want.
Do you like these metrics? Do you think we should be using different ones? Should we try a few different metrics and see what makes sense based on the results?
As additional metrics, we could do the CDFs of the ratio of measured bw to advertised bw, and/or the metrics Karsten produced using just measured bw. (I can't still find the ticket where those were graphed during previous torflow updates, though).
These metrics would be pretty unique to torflow/sbws experiments, but if we have enough of those in the pipeline (such as changes to the scaling factor), they may be worth tracking over time.
Hi Mike,
On 4 Jun 2019, at 06:20, Mike Perry mikeperry@torproject.org wrote:
Mike Perry:
teor:
I have an alternative proposal:
Let's deploy sbws to half the bandwidth authorities, wait 2 weeks, and see if exit bandwidths improve.
We should measure the impact of this change using the tor-scaling measurement criteria. (And we should make sure it doesn't conflict with any other tor-scaling changes.)
I like this plan. To tightly control for emergent effects of all-sbws vs all-torflow, ideally we'd switch back and forth between all-sbws and all-torflow on a synchronized schedule, but this requires getting enough measurement instances of sbws and torflow for authorities to choose either the sbw file, or the torflow file, on some schedule. May be tricky to coordinate, but it would be the most rigorous way to do this.
We could do a version of this based on votes/bwfiles alone, without making dirauths toggle back and forth. However, this would not capture emergent effects (such as quicker bwadjustments in sbws due to decisions to pair relays with faster ones during measurement). Still, even comparing just votes would be better than nothing.
I don't know how possible this is: we would need two independent network connections per bandwidth scanner, one for sbws, and one for torflow.
(Running two scanners on the same connection means that they compete for bandwidth. Perhaps we could use Tor's BandwidthRate to share the bandwidth.)
I also don't know how many authority operators are able to run sbws: Roger might be stuck on Python 2.
And I don't know how often they will be able to switch configs.
Let's make some detailed plans with the dirauth list.
For this experiment, my metric of choice would be "Per-Relay Spare Network Capacity CDF" (see https://trac.torproject.org/projects/tor/wiki/org/roadmaps/CoreTor/Performan...), for both the overall consensus, and every authority's vote. It would also be useful to generate separate flag breakdowns of this CDF (ie produce separate CDFs for Guard-only, Middle-only, Exit-only, and Guard+Exit-only relays).
In this way, we have graphs of how the votes and the consensus distribution of the difference between self-reported and measured values across the network.
Arg, I misspoke here. The metric from that performance experiment page is the difference between peak observed bandwidth and bw history. This will still be interesting to measure load balancing effects, but it does not directly involve the measured values.. We may also want a metric that directly compares properties of the measured vs advertised values. See below.
We should be able to pinpoint any major disagreements in how relays are measured compared to their self-reported values with these metrics. (In the past, karsten produced very similar sets of CDFs of just the measured values per vote when we were updating bwauths, and we compared the shape of the measured CDF, but I think graphing the difference is more comprehensive).
We should also keep an eye on CDF-DL and the failure rainbow metrics, as they may be indirectly affected by improvements/regressions in load balancing, but I think the distribution of "spare capacity" is the first order metric we want.
Yes, I agree: some idea of client bandwidth and latency is important.
Do you like these metrics? Do you think we should be using different ones? Should we try a few different metrics and see what makes sense based on the results?
As additional metrics, we could do the CDFs of the ratio of measured bw to advertised bw, and/or the metrics Karsten produced using just measured bw. (I can't still find the ticket where those were graphed during previous torflow updates, though).
These metrics would be pretty unique to torflow/sbws experiments, but if we have enough of those in the pipeline (such as changes to the scaling factor), they may be worth tracking over time.
If we get funding for sbws experiments, we can definitely tweak the sbws scaling parameters, and do some experiments.
At the moment, I'd like to focus on fixing critical sbws issues, deploying sbws, and making sure it works at least as well as torflow.
T
teor:
Hi Mike,
On 4 Jun 2019, at 06:20, Mike Perry mikeperry@torproject.org wrote:
Mike Perry:
teor:
I have an alternative proposal:
Let's deploy sbws to half the bandwidth authorities, wait 2 weeks, and see if exit bandwidths improve.
We should measure the impact of this change using the tor-scaling measurement criteria. (And we should make sure it doesn't conflict with any other tor-scaling changes.)
I like this plan. To tightly control for emergent effects of all-sbws vs all-torflow, ideally we'd switch back and forth between all-sbws and all-torflow on a synchronized schedule, but this requires getting enough measurement instances of sbws and torflow for authorities to choose either the sbw file, or the torflow file, on some schedule. May be tricky to coordinate, but it would be the most rigorous way to do this.
We could do a version of this based on votes/bwfiles alone, without making dirauths toggle back and forth. However, this would not capture emergent effects (such as quicker bwadjustments in sbws due to decisions to pair relays with faster ones during measurement). Still, even comparing just votes would be better than nothing.
I don't know how possible this is: we would need two independent network connections per bandwidth scanner, one for sbws, and one for torflow.
(Running two scanners on the same connection means that they compete for bandwidth. Perhaps we could use Tor's BandwidthRate to share the bandwidth.)
I also don't know how many authority operators are able to run sbws: Roger might be stuck on Python 2.
And I don't know how often they will be able to switch configs.
Let's make some detailed plans with the dirauth list.
Ok. It looks like I am still on the dirauth list. Perhaps we can come up with some way to use the dirauth-conf repo to switch things, but if we lack the machines for separate sbws and torflow, I agree that we should not try to have the same connections/machines running both.
In that case, we should just focus on tracking the metrics that are important to us as we continue to add sbws and remove torflow instances.
Do you like these metrics? Do you think we should be using different ones? Should we try a few different metrics and see what makes sense based on the results?
As additional metrics, we could do the CDFs of the ratio of measured bw to advertised bw, and/or the metrics Karsten produced using just measured bw. (I can't still find the ticket where those were graphed during previous torflow updates, though).
These metrics would be pretty unique to torflow/sbws experiments, but if we have enough of those in the pipeline (such as changes to the scaling factor), they may be worth tracking over time.
If we get funding for sbws experiments, we can definitely tweak the sbws scaling parameters, and do some experiments.
At the moment, I'd like to focus on fixing critical sbws issues, deploying sbws, and making sure it works at least as well as torflow.
Yes, that makes sense. A minimal version of this could be: don't do the swapping back and forth, just add sbws and replace torflow scanners one by one. As we do this, we could just keep a record of the metrics over the votes and consensus during this time, and compare how the metrics look for the sbws vs torflow votes vs the consensus, over time.
I'll work on precise formulae for the "Per Relay Spare Capacity" metric and the "Measured to Observed Ratio" metric, and think more about how we want to graph them so they are more easy to compare over time. I feel like my previous mails were a little hand-wavy. Depending on how this works out, I will either post that to tor-scaling with a complete list of specific metrics equations, or write a separate post to tor-dev with them just for sbws.
We won't finalize all of the performance experiment metrics until after the Mozilla All Hands meeting (ie: ~3 weeks), but the two above can be retroactively computed using router descriptor and extrainfo archives.
What were you thinking for the timeframe for the complete transition to sbws?
Hi,
On 4 Jun 2019, at 12:54, Mike Perry mikeperry@torproject.org wrote:
teor:
On 4 Jun 2019, at 06:20, Mike Perry mikeperry@torproject.org wrote:
Mike Perry:
teor:
I have an alternative proposal:
Let's deploy sbws to half the bandwidth authorities, wait 2 weeks, and see if exit bandwidths improve.
Yes, that makes sense. A minimal version of this could be: don't do the swapping back and forth, just add sbws and replace torflow scanners one by one. As we do this, we could just keep a record of the metrics over the votes and consensus during this time, and compare how the metrics look for the sbws vs torflow votes vs the consensus, over time.
What were you thinking for the timeframe for the complete transition to sbws?
longclaw has been running sbws for a while. bastet started running it mid-May. We can transition a third directory authority any time we like.
We need to keep 3 torflow instances until we fix these 4 critical sbws bugs: https://lists.torproject.org/pipermail/tor-dev/2019-June/013867.html
After those bugs are fixed, we could transition one per month.
moria1 will need to install python 3 to run sbws, I don't know how long that will take.
Maybe September to December 2019?
T