[tor-dev] Monitoring Service

Tue Feb 2 10:34:03 UTC 2016

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Hi Tom,

On 27/01/16 20:02, Tom Ritter wrote:
> [feel free to reply adding tor-project or whomever]

Sure, let me copy tor-dev at .

> Remember a while ago I lamented that I wished there was some 
> monitoring service that could tell me when my metrics service;
> relay; or bwauth went down?  I finally built one. I'm still kicking
> the tires on it, and I intend to improve it more over the next week
> or two - but I think it's here to stay.
> 
> https://github.com/tomrittervg/checker
> 
> Right now I have it running several monitoring jobs, with a second 
> instance running with no jobs but serving as a peer. I have it 
> checking a number of TCP ports (to see if my relays are still up),
> and I have custom jobs for metrics and the bwauth.  They're in the 
> samplejobs folder.  They're very simplistic and bare-bones.  My
> hope is that they can be fleshed out over time to account for more 
> imaginative ways things could fail.
> 
> I'm already discovering that my bwauth file sometimes gets more
> than two hours behind....
> 
> But I think the most useful thing here is that now I have a
> minimal framework for writing simplistic python jobs and having it
> monitor things for me. Maybe it would be useful for more people?

Yes!  Well, I can't speak for other people, but having a monitoring
system for Metrics-related services would be very useful for me.  In
fact, it's been on my list for a long time now.  This seems like a
great opportunity to spend more thoughts on it.

I'm not sure if I mentioned this before, but we're using Nagios to
monitor Onionoo.  The Nagios script we're using makes a tiny request
to Onionoo to see whether the contained timestamps are still recent.
That's an indirect way to notice problems with the data back-end, and
it has helped with detecting numerous problems in the past.  More
details here:

https://gitweb.torproject.org/admin/tor-nagios.git/tree/tor-nagios-checks/checks/tor-check-onionoo

So, I don't know Nagios enough to say how it compares to your system.
 But I could imagine that we write a similar check for CollecTor that
runs on your system and notifies us of problems.  And maybe it's
possible to write that script in a way that it can also be deployed on
Nagios.

Here's what I could imagine that the script would do: every, say, 10
minutes it would fetch CollecTor's index.json which is specified here:

https://collector.torproject.org/#index-json

The script would then run a series of checks and report one of
statuses OK, WARNING, CRITICAL, or UNKNOWN:

 - host is unreachable or index.json cannot be found for at least 30
minutes (CRITICAL)
 - index.json contains invalid JSON for all checks in the last 30
minutes (CRITICAL)
 - the contained "index_created" timestamp is older than 30 minutes
(WARNING) or older than 3 hours (CRITICAL)
 - when concatenating "path" fields of nested objects, most recent
"last_modified" timestamp by path prefix is more than X behind
"index_created" (CRITICAL):
   - /archive/bridge-descriptors/: 5 days
   - /archive/exit-lists/: 5 days
   - /archive/relay-descriptors/certs.tar.xz: 5 days
   - /archive/relay-descriptors/consensuses/: 5 days
   - /archive/relay-descriptors/extra-infos/: 5 days
   - /archive/relay-descriptors/microdescs/: 5 days
   - /archive/relay-descriptors/server-descriptors/: 5 days
   - /archive/relay-descriptors/votes/: 5 days
   - /archive/torperf/: 5 days
   - /recent/torperf/: 12 hours
   - /recent/bridge-descriptors/extra-infos/: 3 hours
   - /recent/bridge-descriptors/server-descriptors/: 3 hours
   - /recent/bridge-descriptors/statuses/: 3 hours
   - /recent/exit-lists/: 3 hours
   - /recent/relay-descriptors/consensuses/: 1.5 hours
   - /recent/relay-descriptors/extra-infos/: 1.5 hours
   - /recent/relay-descriptors/microdescs/consensus-microdesc/: 1.5 hours
   - /recent/relay-descriptors/microdescs/micro/: 1.5 hours
   - /recent/relay-descriptors/server-descriptors/: 1.5 hours
   - /recent/relay-descriptors/votes/: 1.5 hours

In the detailed checks above, the script would not warn if index.json
does not contain any files with a given prefix (so that you can run
the script on your CollecTor instance that doesn't collect all the
things).  And ideally, the script would include all warnings in its
output, not just the first.

That's one check, and it would probably catch most problems where
things get stale.  I would like to add more checks, but those would
need more access to the CollecTor host than its index.json which is
publicly available.  (That could mean that we export more status in a
debug.json or in another public place.)  Some examples as follows, and
I'm only listing them for later:

 - Is the host soon going to run out of disk space or inodes?  (This
can easily be done with Nagios, I think.)
 - Did the relay-descriptor part of CollecTor fail to parse a
descriptor using metrics-lib and hence did not store it to disk?  (I'm
receiving hourly cron mails in this case, but I'd prefer something
better.)
 - Are we missing more than a certain threshold of relay descriptors
referenced from other relay descriptors?  For example, are we missing
more than 0.5% of server descriptors referenced from the consensus?
(I'm also receiving hourly cron mails, but the checks there are
somewhat broken and can create a lot of noise.)
 - Same as above but for sanitized bridge descriptors.  (I don't have
such checks in place right now.)
 - Does the bridge sanitizer have trouble sanitizing bridge IP
addresses in particular storing secret keys for the keyed hash
function?  (Cron would send mail, but that never happened so far.)
 - Did the bridge sanitizer run into an unknown line and hence refrain
from storing the sanitized version of a bridge descriptor?  (Cron
sends mail which happens every now and then.)
 - Do the exit lists downloaded from TorDNSEL contain recent
publication times but no recent scan times?
 - Does one or more of the Torperf hosts not report recent measurements?

There, that's a long list.  And it's just the list for CollecTor.
Next would be more detailed checks for Onionoo that look into its
internal operation or into problems with its web-facing part; Metrics
including the various data-aggregating scripts running in the
background; and ExoneraTor with its database importer to make sure it
always has fresh and complete descriptors.

Hope this was not too overwhelming.  If you're still reading this and
still want to help, let's talk more about writing that first script
for CollecTor.

Thanks!

All the best,
Karsten

-----BEGIN PGP SIGNATURE-----
Comment: GPGTools - http://gpgtools.org

iQEcBAEBAgAGBQJWsIYbAAoJEJD5dJfVqbCrY3EH/00mw1INpPniTRXnd981nJEl
EAZdySgKNcuA6NIVrGMrRdRgH3p3W2UhVJYlyHpHquZOx0nLmEDH8XDyBcxvICEQ
UviwPGZGiQCw+TYqFYlUPvM2d0DwZmjyQbMeoANo4r8BczqFKmwi9GC1xI/zc3/3
UT91OXyZeKIbhopaeyIWW868p/u8Cs9sGDSJuWZHmDqQjPL2sE0lM8bQtdK1wJWj
UNTCYPT35FfSHYeuWXASDgqt0e8qrppWKhEzQgmWggAU6sPx1MMSz323fXymdPCX
OyHbXYjzFsABE6+CAO5iR1c6AI3TFHxOtVutA8gfXC52GjgDApVETd+kywrNSZM=
=kRdX
-----END PGP SIGNATURE-----