commit 421bea4058c95a2055f043ab4f97fa978c9c4c97 Author: George Kadianakis desnacked@riseup.net Date: Tue Mar 2 19:48:36 2021 +0200
Improve some parts of prop#328.
- Rename 'overload-reached' to 'overload-general' - Simplify 'overload-ratelimits' for engineering reasons - Add versioning on the extra-info fields - Add a few more metrics --- proposals/328-relay-overload-report.md | 39 ++++++++++++++++++++++------------ 1 file changed, 26 insertions(+), 13 deletions(-)
diff --git a/proposals/328-relay-overload-report.md b/proposals/328-relay-overload-report.md index b05d289..f5d3193 100644 --- a/proposals/328-relay-overload-report.md +++ b/proposals/328-relay-overload-report.md @@ -36,20 +36,22 @@ the future and thus this is not an exhaustive list. The general overload line indicates that a relay has reached an "overloaded state" which can be one or many of the following load metrics:
- - Any OOMkiller invocation due to memory pressure - - Any onionskins are dropped - - CPU utilization of Tor's mainloop CPU core above 90% for 60 sec + - Any OOM invocation due to memory pressure + - Any ntor onionskins are dropped - TCP port exhaustion + - DNS timeout reached + - CPU utilization of Tor's mainloop CPU core above 90% for 60 sec + - Control port overload (too many messages queued)
The format of the overloaded line added in the extra-info document is as follow:
``` -"overload-reached" YYYY-MM-DD HH:MM:SS NL +"overload-general" SP version SP YYYY-MM-DD HH:MM:SS NL [At most once.] ```
-The timestamp is when a at least one metrics was detected. It should always be +The timestamp is when at least one metrics was detected. It should always be at the hour and thus, as an example, "2020-01-10 13:00:00" is an expected timestamp. Because this is a binary state, if the line is present, we consider that it was hit at the very least once somewhere between the provided @@ -60,27 +62,35 @@ The overload field should remain in place for 72 hours since last triggered. If the limits are reached again in this period, the timestamp is updated, and this 72 hour period restarts.
+The 'version' field is set to '1' for the initial implementation of this +proposal which includes all the above overload metrics except from the CPU and +control port overload. The first version also uses a primitive logic for +detecting DNS timeouts (only if libevent failed a set of 3 DNS requests/retries +in a row). + # 1.2. Token bucket size
Relays should report the 'BandwidthBurst' and 'BandwidthRate' limits in their descriptor, as well as the number of times these limits were reached, for read -and write, in the past 24 hours starting at the provided timestamp rounded -down to the hour. +and write, in the past 24 hours starting at the provided timestamp rounded down +to the hour.
``` "overload-ratelimits" SP YYYY-MM-DD SP HH:MM:SS SP rate-limit SP burst-limit - SP read-rate-count SP read-burst-count - SP write-rate-count SP write-burst-count NL + SP read-overload-count SP write-overload-count NL [At most once.] ```
The "rate-limit" and "burst-limit" are the raw values from the BandwidthRate and BandwidthBurst found in the torrc configuration file.
-The "{read|write}-rate-count" and "{read|write}-burst-count" are the counts of -how many times the reported limits were exhausted and thus the maximum between -the read and write count occurances. +The "{read|write}-overload-count" are the counts of how many times the reported +limits of burst/rate were exhausted and thus the maximum between the read and +write count occurances. + +The 'version' field is set to '1' for the initial implementation of this +proposal.
# 1.3. File Descriptor Exhaustion
@@ -91,7 +101,7 @@ notice which relay has a value too small and we can notify them. This should be published in this format:
``` -"overload-fd-exhausted" YYYY-MM-DD HH:MM:SS NL +"overload-fd-exhausted" SP version YYYY-MM-DD HH:MM:SS NL [At most once.] ```
@@ -102,6 +112,9 @@ This overload field should remain in place for 72 hours since last triggered. If the limits are reached again in this period, the timestamp is updated, and this 72 hour period restarts.
+The 'version' field is set to '1' for the initial implementation of this +proposal which detects fd exhaustion only when a socket open fails. + # 2. Load Metrics
This section proposes a series of metrics that should be collected and
tor-commits@lists.torproject.org