On 16 May 2020, at 16:05, Mike Perry mikeperry@torproject.org wrote:
On 4/23/20 1:48 PM, Matt Traudt wrote:
5.4 Other Changes/Investigations/Ideas
- How can FlashFlow data be used in a way that doesn't lead to poor
load balancing given the following items that lead to non-uniform client behavior:
- Guards that high-traffic HSs choose (for 3 months at a time)
- Guard vs middle flag allocation issues
- New Guard nodes (Guardfraction)
- Exit policies other than default/all
- Directory activity
- Total onion service activity
- Super long-lived circuits
- What is the explanation for dennis.jackson's scary graphs in this [2]
ticket? Was it because of the speed test? Why? Will FlashFlow produce the same behavior?
It will also be wise to provide a way for relays to signify that they are on the same machine. I bet concurrent machine deployments are one of the top contributors to the long tail of bad perf we saw caused by the Flashflow experiment[2]. If flashflow measures each such relay as having the full link capacity instead of a shared fraction, this is obviously going to result in overload on those relays, leading to a long tail of bad perf when they are chosen and are also overloaded. It is unlikely that we can deploy a FlashFlow that has this long tail perf problem without fixing this and related balancing issues (though hopefully most will be smoothed over by sbws).
This is a little tricky, because we might not want rogue relays joining each others "machines" (similar to the Family problem), but for testing something as simple as how MyFamily works would be great. Ideally, though, relays would ask or detect that they are concurrently running in nearby IP space and either warn the operator to set the flag, or set it automatically.
We actually have this work included in a future performance funding proposal, but the timeline on that getting approved (or even rejected) is so far out that we should figure out a way to do this before that, especially if Flashflow development is going to begin soon.
We could assume that relays on the same IPv4 /24 or IPv6 /48 share a network link, and re-do the experiment.
Then we could tweak the network size based on those results. We'd need to compromise between "false sharing" and "missed sharing".
Then individual operators could fine-tune that initial heuristic using the "same network link" config.
(This is similar to how MyFamily works: Tor assumes that relays in the same IPv4 /16 and IPv6 /32 have the same network operator. Then individual relay operators can declare extra families using MyFamily.)
T