On 09 Dec (11:36:09), Jim Newsome wrote:
On 12/7/20 14:06, David Goulet wrote:
Greetings,
Attached is a proposal from Mike Perry and I. Merge requsest is here:
https://gitlab.torproject.org/tpo/core/torspec/-/merge_requests/22
Disclaimer - As someone not very familiar with how tor load balancing works today, I might not be the target audience for this proposal :)
Maybe it's putting the cart before the horse, but it might be helpful to have a more concrete proposal for how this data will be used, which in turn will help evaluate whether this is the right data to collect.
e.g. naively I might assume the idea is to have some kind of exponential backoff for overloaded relays; but since the proposal is for the overload events to be recorded at hour-granularity, would that result in a relay getting overloaded at the top of every hour, and then under-utilized for the rest of the hour?
Right so there are currently ideas circulating around on how to use that data properly.
The likely short-term proposal is sbws (bw scanner) that will use that as a simple signal to backoff on the amount of bw given, as you stated.
Thus your question is right on the nail there about "why we have this proposal without a concrete proposal on how to use it" :).
The answer I can give you is that we've thought on how for a relay to tell the world, in a safe way, that it is suffocating. There are few places in the tor we can actually notice (at the moment) performance problems.
And so we took them all (more might come over time), and mashed them into a single line "overload reached". And we did that before anything else because for the network to migrate to support that feature, we are talking a good 2-4 years minimum once the feature is stable thus we have to get this out soon if we hope to be useful in the foreseeable future.
Onto your next question about the hour problem. So yes, you are correct that the timeframe between informing the world I'm not overloaded anymore and the world noticing, you might get under-utilized but you might also get "just utilized enough".
All in all, we are stuck with a network that "morphs" every hour (new consensus) but actually, bandwidth scanners take much longer to scan the entire network (in the realms of days) thus it is actually much more than an hour of being under-utilized :S.
So there will always be that gap where we will backoff from a relay and then we might have backed off too much until the scanner notices it and then give you a bit more. But over time, as the backoff happens and the bw scanner makes correction, they should reach an equilibrium where the scanner finds the value that is just enough for you to not advertise overload anymore or in other words finding the sweet spot.
That is likely to require time and the relay to be maxi stable as in 99% uptime and not too CPU/network fluctuations.
But also, as we backoff from overloaded relays, we will send traffic to potentially under-utilized relays and so we hope that yes it will be a bumpy road at first but after some days/weeks, network should stabilize and we should actually see very few "overload-reached" after that point (except for operators running 1000 other things on the relay machine eating the resources randomly :).
This does highlight also the massive importance of stable relays on the network so its load balancing can adjust and converge to an equilibrium without having to re-adjust because 1000 relays on pi4 went down for the night :).
Hope this answers your question!
Cheers! David