On 12/11/20 08:04, David Goulet wrote:
we are talking a good 2-4 years minimum once the feature is stable thus we have to get this out soon if we hope to be useful in the foreseeable future.
Right - the slow feedback cycle of deploying between deploying new logging and trying to use it is all the more reason to plan ahead to try to ensure the data will actually be suitable for the intended use :). Granted, we can presumably at least *start* trying to prototype usage of the data sooner than 2-4 years, but it'll probably still be some months before any useful data starts arriving, right?
Onto your next question about the hour problem. So yes, you are correct that the timeframe between informing the world I'm not overloaded anymore and the world noticing, you might get under-utilized but you might also get "just utilized enough".
All in all, we are stuck with a network that "morphs" every hour (new consensus) but actually, bandwidth scanners take much longer to scan the entire network (in the realms of days) thus it is actually much more than an hour of being under-utilized :S.
So there will always be that gap where we will backoff from a relay and then we might have backed off too much until the scanner notices it and then give you a bit more. But over time, as the backoff happens and the bw scanner makes correction, they should reach an equilibrium where the scanner finds the value that is just enough for you to not advertise overload anymore or in other words finding the sweet spot.
That is likely to require time and the relay to be maxi stable as in 99% uptime and not too CPU/network fluctuations.
But also, as we backoff from overloaded relays, we will send traffic to potentially under-utilized relays and so we hope that yes it will be a bumpy road at first but after some days/weeks, network should stabilize and we should actually see very few "overload-reached" after that point (except for operators running 1000 other things on the relay machine eating the resources randomly :).
Thanks for the explanation! IIUC the new consensus computed every hour includes weights based on the latest data from the bandwidth scanners, and an individual relay is only scanned once every x days?
In the proposal, maybe it'd be enough to briefly explain the choices of parameters and any relevant tradeoffs - one hour for granularity, 72 hours for history, (any others?). It might also be helpful to have a strawman example of how the data could be used in the congestion control algorithm, with some discussion like the above ^, though I could also see that potentially getting too far from the core of the proposal.
Btw, maybe it's worth explicitly explaining how the data *won't* be useful for attackers? I'd assumed (possibly incorrectly) that the history wasn't being kept at a finer granularity to avoid being able to correlate events across relays, and from there perhaps be able to infer something about individual circuit paths. If that sort of attack is worth worrying about, should relays also suppress reporting events for the current partial hour to avoid an attacker being able to probe the metrics port to find out if an overload just happened?
Hope this answers your question!
Very helpful, thanks!