Re: [tor-dev] Proposal 328: Make Relays Report When They Are Overloaded

3 Mar 2021

      On 02 Mar (20:58:43), Mike Perry wrote:
...
On 3/2/21 6:01 PM, George Kadianakis wrote:
...
David Goulet <dgoulet@torproject.org> writes:
...
Greetings,
Attached is a proposal from Mike Perry and I. Merge requsest is here:
https://gitlab.torproject.org/tpo/core/torspec/-/merge_requests/22
Hello all,
while working on this proposal I had to change it slightly to add a few
more metrics and also to simplify some engineering issues that we would
encounter. You can find the changes here:
           https://gitlab.torproject.org/asn/torspec/-/commit/b57743b9764bd8e6ef8de689d...
Mike, based on your comments in the #40222 ticket, I would appreciate
comments on the way the DNS issues will be reported. David argued that
they should not be part of the "overload-general" line because they are
not an overload and it's not the fault of the network in any way. This
is why we added them as separate lines. Furthermore, David suggested we
turn them into a threshold "only report if 25% of the total requests
have timed out" instead of "only report if at least one time out has
occured" since that would be more useful.
I'm confused by this confusion. There's pretty clear precedent for
treating packet drops as a sign of network capacity overload. We've also
seen it experimentally specifically with respect to DNS, during Rob's
experiment. We discussed this on Monday.
However, I agree there's a chance that a single packet drop can be
spurious, and/or could be due to ephemeral overload as TCP congestion
causes. But 25% is waaaaaaaaaay too high. Even 1% is high IMO, but is
more reasonable. We should ask some exits what they see now. The fact
that our DNS scanners are not currently seeing this at all, and the
issue appeared only for the exact duration of Rob's experiment, suggests
that DNS packets drops are extremely rare in healthy network conditions.
Ok, likely 25% is way too high indeed.

The idea behind this was simply that a network hiccup or a temporary faulty
DNS server would not move away traffic from the Exit for a 72h period
(reminder that the "overload-general" sticks for 72h in the extrainfo once
hit).
...
Furthermore, revealing the specific type of overload condition
increases the ability for the adversary to use this information for
various attacks. I'd rather it be combined in all cases, so that the
specific cause is not visible. In all cases, the reaction of our systems
should be the same: direct less load to relays with this line. If we
need to dig, that's what MetricsPort is for.
In fact, this DNS packet drop signal may be particularly useful in
traffic analysis attacks. Its reporting, and likely all of this overload
reporting, should probably be delayed until something like the top of
the hour after it happens. We may even want this delay to be a consensus
parameter. Something like "Report only after N minutes", or "Report only
N minute windows", perhaps?
Yes definitely and I would even add a random component in this so not all
relays will report an overload in a predictable timeframe and thus "if the
line appear, I know it was hit N hours ago" type of calculation.

Cheers!
David

-- 
QlSpNB+aSzOYvM3E0etjbW84Wyx4/7PrwKfWOtmEgE0=