[metrics-team] Brainstorming better notifications for operational issues on Wednesday, November 1, 14:30 UTC

Karsten Loesing karsten at torproject.org
Wed Nov 1 15:47:40 UTC 2017


The following notes are the result of a 1.25-hour brainstorming session.
They should not be seen as final decisions on anything, but rather as
starting points for further discussions.

Requirements for a notification service:
 - Notification service can be run locally for a start but should be run
on a Tor host in the longer term.

How we could monitor our services?
1. Log warnings or errors whenever we realize we're having an issue, and
somehow send those errors/warnings via email.
2. Periodically request public resources via web interfaces and perform
basic smoke tests, without adding specific information just for the sake
of better monitoring.
3. Locally run checks on the hosts, including whether a given process is
still running.

What notifications do we already get, and what should we add (by service)?
1. CollecTor
 - notifications about errors or warnings in logs
 - learn when the disk almost runs full (currently provided by Tor's
Nagios and by a warning in the logs)
 - learn when a collector process has died, either by checking locally
whether the process still exists, by looking at logs for regular
info/notice level entries, or by fetching the index.json and looking
whether the "index_created" timestamp is older than 30 minutes/3 hours
 - learn when a data source has become stale by looking at
"last_modified" timestamps contained in index.json or by looking at the logs
2. OnionPerf
 - Does one or more of the OnionPerf hosts not report recent measurements?
3. Onionoo
 - [deployed] Onionoo has a Nagios warning that fetches a minimal
response and checks timestamps (which is the only way how we notice
problems with the bridge authority), but cf. #23984
 - nusenu suggests via email (mostly as an onionoo user):
   - reachability (TCP)
   - service working (HTTP 200 vs. 404, 500,...) (via active probes and
via log monitoring. Increase in 500 status codes?)
   - response times (significantly higher than usual?)
   - data updated? (i.e. onionoo data older than 4-5 hours should
trigger an alert)
   - minimal sanity checks (i.e. /details should contain more than 5k
relay records) [KL: note that we wouldn't have to fetch 5k records for
this, we could just parse relays_skipped.]
4. Statistics (part of metrics-web)
 - [deployed] metrics-web sends a short log twice per day,
5. ExoneraTor
 - [deployed] ExoneraTor sends a message when it finds an existing lock
file, etc.
6. Website (Tor Metrics, plus Atlas, ExoneraTor, Compass etc. until
they're migrated)
7. Bot
8. Notification service
 - Learn when the notification service itself goes down!

What tools should we use to notify us?
 - Right now, we're depending on cron emails and in one case Tor's
Nagios service to notify us. Most of the current notifications are
workarounds and cheap hacks. And they‘re highly dependent on karsten and
don‘t scale to other folks in the team.
 - We could add more scripts to Tor's Nagios instance.
 - We could run our own Nagios instance.
 - We could write our own notification service. But how do we access
system information that is not provided publicly via (web) interface?
 - It doesn't always need a special instance; logback also provides
mailing (I used that on the gone CollecTor instance).
 - Consider Tom's notification tool as a lightweight alternative to
Nagios et al. (see his message to the list)


-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 528 bytes
Desc: OpenPGP digital signature
URL: <http://lists.torproject.org/pipermail/metrics-team/attachments/20171101/f8ddcc54/attachment.sig>


More information about the metrics-team mailing list