-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
Hi Tom,
On 27/01/16 20:02, Tom Ritter wrote:
[feel free to reply adding tor-project or whomever]
Sure, let me copy tor-dev@.
Remember a while ago I lamented that I wished there was some monitoring service that could tell me when my metrics service; relay; or bwauth went down? I finally built one. I'm still kicking the tires on it, and I intend to improve it more over the next week or two - but I think it's here to stay.
https://github.com/tomrittervg/checker
Right now I have it running several monitoring jobs, with a second instance running with no jobs but serving as a peer. I have it checking a number of TCP ports (to see if my relays are still up), and I have custom jobs for metrics and the bwauth. They're in the samplejobs folder. They're very simplistic and bare-bones. My hope is that they can be fleshed out over time to account for more imaginative ways things could fail.
I'm already discovering that my bwauth file sometimes gets more than two hours behind....
But I think the most useful thing here is that now I have a minimal framework for writing simplistic python jobs and having it monitor things for me. Maybe it would be useful for more people?
Yes! Well, I can't speak for other people, but having a monitoring system for Metrics-related services would be very useful for me. In fact, it's been on my list for a long time now. This seems like a great opportunity to spend more thoughts on it.
I'm not sure if I mentioned this before, but we're using Nagios to monitor Onionoo. The Nagios script we're using makes a tiny request to Onionoo to see whether the contained timestamps are still recent. That's an indirect way to notice problems with the data back-end, and it has helped with detecting numerous problems in the past. More details here:
https://gitweb.torproject.org/admin/tor-nagios.git/tree/tor-nagios-checks/ch...
So, I don't know Nagios enough to say how it compares to your system. But I could imagine that we write a similar check for CollecTor that runs on your system and notifies us of problems. And maybe it's possible to write that script in a way that it can also be deployed on Nagios.
Here's what I could imagine that the script would do: every, say, 10 minutes it would fetch CollecTor's index.json which is specified here:
https://collector.torproject.org/#index-json
The script would then run a series of checks and report one of statuses OK, WARNING, CRITICAL, or UNKNOWN:
- host is unreachable or index.json cannot be found for at least 30 minutes (CRITICAL) - index.json contains invalid JSON for all checks in the last 30 minutes (CRITICAL) - the contained "index_created" timestamp is older than 30 minutes (WARNING) or older than 3 hours (CRITICAL) - when concatenating "path" fields of nested objects, most recent "last_modified" timestamp by path prefix is more than X behind "index_created" (CRITICAL): - /archive/bridge-descriptors/: 5 days - /archive/exit-lists/: 5 days - /archive/relay-descriptors/certs.tar.xz: 5 days - /archive/relay-descriptors/consensuses/: 5 days - /archive/relay-descriptors/extra-infos/: 5 days - /archive/relay-descriptors/microdescs/: 5 days - /archive/relay-descriptors/server-descriptors/: 5 days - /archive/relay-descriptors/votes/: 5 days - /archive/torperf/: 5 days - /recent/torperf/: 12 hours - /recent/bridge-descriptors/extra-infos/: 3 hours - /recent/bridge-descriptors/server-descriptors/: 3 hours - /recent/bridge-descriptors/statuses/: 3 hours - /recent/exit-lists/: 3 hours - /recent/relay-descriptors/consensuses/: 1.5 hours - /recent/relay-descriptors/extra-infos/: 1.5 hours - /recent/relay-descriptors/microdescs/consensus-microdesc/: 1.5 hours - /recent/relay-descriptors/microdescs/micro/: 1.5 hours - /recent/relay-descriptors/server-descriptors/: 1.5 hours - /recent/relay-descriptors/votes/: 1.5 hours
In the detailed checks above, the script would not warn if index.json does not contain any files with a given prefix (so that you can run the script on your CollecTor instance that doesn't collect all the things). And ideally, the script would include all warnings in its output, not just the first.
That's one check, and it would probably catch most problems where things get stale. I would like to add more checks, but those would need more access to the CollecTor host than its index.json which is publicly available. (That could mean that we export more status in a debug.json or in another public place.) Some examples as follows, and I'm only listing them for later:
- Is the host soon going to run out of disk space or inodes? (This can easily be done with Nagios, I think.) - Did the relay-descriptor part of CollecTor fail to parse a descriptor using metrics-lib and hence did not store it to disk? (I'm receiving hourly cron mails in this case, but I'd prefer something better.) - Are we missing more than a certain threshold of relay descriptors referenced from other relay descriptors? For example, are we missing more than 0.5% of server descriptors referenced from the consensus? (I'm also receiving hourly cron mails, but the checks there are somewhat broken and can create a lot of noise.) - Same as above but for sanitized bridge descriptors. (I don't have such checks in place right now.) - Does the bridge sanitizer have trouble sanitizing bridge IP addresses in particular storing secret keys for the keyed hash function? (Cron would send mail, but that never happened so far.) - Did the bridge sanitizer run into an unknown line and hence refrain from storing the sanitized version of a bridge descriptor? (Cron sends mail which happens every now and then.) - Do the exit lists downloaded from TorDNSEL contain recent publication times but no recent scan times? - Does one or more of the Torperf hosts not report recent measurements?
There, that's a long list. And it's just the list for CollecTor. Next would be more detailed checks for Onionoo that look into its internal operation or into problems with its web-facing part; Metrics including the various data-aggregating scripts running in the background; and ExoneraTor with its database importer to make sure it always has fresh and complete descriptors.
Hope this was not too overwhelming. If you're still reading this and still want to help, let's talk more about writing that first script for CollecTor.
Thanks!
All the best, Karsten