On 03 Feb (00:24:00), nusenu wrote:
Hi,
Hello nusenu,
Thanks for this email. I exporting more metrics on the control port is a great idea. I wanted to have that for a while so quite happy you started rolling the ball :).
There are safety questions to ask ourselves here before blindly exporting many stats. The metrics team also I know has opinion on that, I had a talk very recently with irl on this.
Exporting many stats to the control port unfortunately means that all relay operator can possibly create fancy graphs and make them public which, depending on the stat, can be harmful.
Furthermore, graphing stats can also means that over time the relay operator stores historical data of everything that happened within the relay and that can be used in many ways to pull off attacks (ex: subpoena to access such data base by LE).
The Heartbeat log has a minimum of 30 minutes period but a default of 6 hours. Whatever stats we would end up exporting, I strongly think that keeping delays like that is a strong requirement because we would sort of "bin" those aggregated stats by a "long enough period" instead of having a very fine grained stream of stats that would make it trivial to spot spikes down to the minute.
Some of the stats below are safe in my opinion like the memory usage but most of them need to be looked at in terms of safety from both the stand point of having a very fine grained precision with them and what happens when that data becomes historical data?
I'll stop for now but I will follow up on this once I have thought a bit more about it so I don't say too many stupid things right now :).
Cheers! David
every now and then I'm in contact with relay operators about the "health" of their relays. Following these 1:1 discussions and the discussion on tor-relays@ I'd like to rise two issues with you (the developers) with the goal to help improve relay operations and end user experience in the long term:
DNS (exits only)
tor relay health data
DNS
Current situation: Arthur Edelstein provides public measurements to tor exit relay operators via his page at: https://arthuredelstein.net/exits/ This page is updated once daily.
the process to use that data looks like this:
- first they watch Arthur's measurement results
- if their failure rate is non-zero they try to tweak/improve/change their setup
- wait for another 24 hours (next measurement)
This is a somewhat suboptimal and slow feedback loop and is probably also less accurate and less valuable data when compared to the data the tor process can provide.
Suggestion for improvement:
Exposes the following DNS status information via tor's controlport to help debug and detect DNS issues on exit relays:
(total numbers since startup)
- amount of DNS queries send to the resolver
- amount of DNS queries send to the resolver due to a RESOLVE request
- DNS queries send to resolver due to a reverse RESOLVE request
- amount of queries that did not result in any answer from the resolver
- breakdown of number of responses by response code (RCODE)
https://www.iana.org/assignments/dns-parameters/dns-parameters.xhtml#dns-par...
- max amount of DNS queries send per curcuit
If this causes a significant performance impact this feature should be disabled by default.
- general relay health metrics
Compared to other server daemons (webserver, DNS server, ..) tor provides little data for operators to detect operational issues and anomalies.
I'd suggest to provide the following stats via the control port: (most of them are already written to logfiles by default but not accessible via the controlport as far as I've seen)
- total amount of memory used by the tor process
- amount of currently open circuits
- circuit handshake stats (TAP / NTor)
DoS mitigation stats
amount of circuits killed with too many cells
amount of circuits rejected
marked addresses
amount of connections closed
amount of single hop clients refused
amount of closed/failed circuits broken down by their reason value
https://gitweb.torproject.org/torspec.git/tree/tor-spec.txt#n1402 https://gitweb.torproject.org/torspec.git/tree/control-spec.txt#n1994
- amount of closed/failed OR connections broken down by their reason value
https://gitweb.torproject.org/torspec.git/tree/control-spec.txt#n2205
If this causes a significant performance impact this feature should be disabled by default.
cell stats
- extra info cell stats
as defined in: https://gitweb.torproject.org/torspec.git/tree/dir-spec.txt#n1072
This data should be useful to answer the following questions:
- High level questions: Is the tor relay healthy?
- is it hitting any resource limits?
- is the tor process under unusual load?
- why is tor using more memory?
- is it slower than usual at handling circuits?
- can the DNS resolver handle the amount of DNS queries tor is sending it?
This data could help prevent errors from occurring or provide additional data when trying to narrow down issues.
When it comes to the question: **Is it "safe" to make this data accessible via the controlport?**
I assume it is safe for all information that current versions of tor writes to logfiles or even publishes as part of its extra info descriptor.
Should tor provide this or similar data I'm planing to write scripts for operators to make use of that data (for example a munin plugin that connects to tor's controlport).
I'm happy to help write updates for control-spec should these features seem reasonable to you.
Looking forward to hearing your feedback. nusenu
-- https://twitter.com/nusenu_ https://mastodon.social/@nusenu
tor-dev mailing list tor-dev@lists.torproject.org https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-dev