Re: [tor-dev] tor relay process health data for operators (controlport)

3 Feb 2019

      On 03 Feb (00:24:00), nusenu wrote:
...
Hi,
Hello nusenu,

Thanks for this email. I exporting more metrics on the control port is a
great idea. I wanted to have that for a while so quite happy you started
rolling the ball :).

There are safety questions to ask ourselves here before blindly
exporting many stats. The metrics team also I know has opinion on that,
I had a talk very recently with irl on this.

Exporting many stats to the control port unfortunately means that all
relay operator can possibly create fancy graphs and make them public
which, depending on the stat, can be harmful.

Furthermore, graphing stats can also means that over time the relay
operator stores historical data of everything that happened within the
relay and that can be used in many ways to pull off attacks (ex:
subpoena to access such data base by LE).

The Heartbeat log has a minimum of 30 minutes period but a default of 6
hours. Whatever stats we would end up exporting, I strongly think that
keeping delays like that is a strong requirement because we would sort
of "bin" those aggregated stats by a "long enough period" instead of
having a very fine grained stream of stats that would make it trivial to
spot spikes down to the minute.

Some of the stats below are safe in my opinion like the memory usage but
most of them need to be looked at in terms of safety from both the stand
point of having a very fine grained precision with them and what happens
when that data becomes historical data?

I'll stop for now but I will follow up on this once I have thought a bit
more about it so I don't say too many stupid things right now :).

Cheers!
David
...
every now and then I'm in contact with relay operators
about the "health" of their relays.
Following these 1:1 discussions and the discussion on tor-relays@
I'd like to rise two issues with you (the developers) with the goal 
to help improve relay operations and end user experience in the long term:
1) DNS (exits only)
2) tor relay health data
1) DNS
------
Current situation: 
Arthur Edelstein provides public measurements to tor exit relay operators via
his page at: https://arthuredelstein.net/exits/
This page is updated once daily.
the process to use that data looks like this:
- first they watch Arthur's measurement results
- if their failure rate is non-zero they try to tweak/improve/change their setup
- wait for another 24 hours (next measurement)
This is a somewhat suboptimal and slow feedback loop and is probably also
less accurate and less valuable data when compared to the data the tor
process can provide.
Suggestion for improvement:
Exposes the following DNS status information 
via tor's controlport to help debug and detect DNS issues on exit relays:
(total numbers since startup)
- amount of DNS queries send to the resolver
- amount of DNS queries send to the resolver due to a RESOLVE request
- DNS queries send to resolver due to a reverse RESOLVE request
- amount of queries that did not result in any answer from the resolver
- breakdown of number of responses by response code (RCODE)
https://www.iana.org/assignments/dns-parameters/dns-parameters.xhtml#dns-par...
- max amount of DNS queries send per curcuit
If this causes a significant performance impact this feature should be disabled
by default.
2) general relay health metrics
--------------------------------
Compared to other server daemons (webserver, DNS server, ..)
tor provides little data for operators to detect operational issues
and anomalies.
I'd suggest to provide the following stats via the control port:
(most of them are already written to logfiles by default but not accessible
via the controlport as far as I've seen)
- total amount of memory used by the tor process
- amount of currently open circuits 
- circuit handshake stats (TAP / NTor)
DoS mitigation stats 
- amount of circuits killed with too many cells 
- amount of circuits rejected
- marked addresses
- amount of connections closed
- amount  of single hop clients refused
- amount of closed/failed circuits broken down by their reason value
https://gitweb.torproject.org/torspec.git/tree/tor-spec.txt#n1402
https://gitweb.torproject.org/torspec.git/tree/control-spec.txt#n1994
- amount of closed/failed OR connections broken down by their reason value
https://gitweb.torproject.org/torspec.git/tree/control-spec.txt#n2205
If this causes a significant performance impact this feature should be disabled
by default.
cell stats
- extra info cell stats
as defined in:
https://gitweb.torproject.org/torspec.git/tree/dir-spec.txt#n1072
This data should be useful to answer the following questions:
- High level questions: Is the tor relay healthy?
- is it hitting any resource limits? 
- is the tor process under unusual load?
- why is tor using more memory?
- is it slower than usual at handling circuits?
- can the DNS resolver handle the amount of DNS queries tor is sending it?
This data could help prevent errors from occurring or provide
additional data when trying to narrow down issues.
When it comes to the question: 
**Is it "safe" to make this data accessible via the controlport?**
I assume it is safe for all information that current versions of 
tor writes to logfiles or even publishes as part of its extra info descriptor.
Should tor provide this or similar data 
I'm planing to write scripts for operators to make use
of that data (for example a munin plugin that connects to tor's controlport).
I'm happy to help write updates for control-spec should these features 
seem reasonable to you.
Looking forward to hearing your feedback.
nusenu
-- 
https://twitter.com/nusenu_
https://mastodon.social/@nusenu

...
_______________________________________________
tor-dev mailing list
tor-dev@lists.torproject.org
https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-dev
-- 
UfBKIa+1kdl7DdvHs4X6EOXF+4kISRk8P8gM6dH/i1E=

Re: [tor-dev] tor relay process health data for operators (controlport)

David Goulet