[metrics-team] Brainstorming better notifications for operational issues on Wednesday, November 1, 14:30 UTC

Wed Nov 1 16:48:57 UTC 2017

Hi,

On 01/11/17 15:47, Karsten Loesing wrote:
> The following notes are the result of a 1.25-hour brainstorming session.
> They should not be seen as final decisions on anything, but rather as
> starting points for further discussions.

Sorry I missed this. I'm still getting the hang of this daylight savings
time thing.

> How we could monitor our services?
> 1. Log warnings or errors whenever we realize we're having an issue, and
> somehow send those errors/warnings via email.

metrics-bot can listen on an Onion service and then relay notifications
to IRC. I'm planning to move the "production" metrics-bot to a DO
droplet soon so it would also be possible to have the notifications sent
over the Internet (ACL restricted).

> 2. Periodically request public resources via web interfaces and perform
> basic smoke tests, without adding specific information just for the sake
> of better monitoring.

metrics-bot already does this for generating microblog status updates
(from Onionoo) and will start doing this for metrics-web CSV files in
the future. It's only doing it for things that it consumes though,
monitoring is a side-effect.

> 3. Locally run checks on the hosts, including whether a given process is
> still running.

We can use Nagios Remote Plugin Executor (NRPE) for this if the sysadmin
team is happy with that.

> 1. CollecTor
>  - notifications about errors or warnings in logs

Is there a regular expression we can match on?

>  - learn when the disk almost runs full (currently provided by Tor's
> Nagios and by a warning in the logs
Is this using NRPE already?

>  - learn when a collector process has died, either by checking locally
> whether the process still exists, by looking at logs for regular
> info/notice level entries, or by fetching the index.json and looking
> whether the "index_created" timestamp is older than 30 minutes/3 hours

We can write a Nagios check for this. It would look very similar to the
existing check for Onionoo (fetching and parsing JSON).

>  - learn when a data source has become stale by looking at
> "last_modified" timestamps contained in index.json or by looking at the logs

As above.

> 2. OnionPerf
>  - Does one or more of the OnionPerf hosts not report recent measurements?

As above, but parsing the HTML (my preference would be to do this with
bs4, it's in Debian stable).

> 3. Onionoo
>  - [deployed] Onionoo has a Nagios warning that fetches a minimal
> response and checks timestamps (which is the only way how we notice
> problems with the bridge authority), but cf. #23984
>  - nusenu suggests via email (mostly as an onionoo user):
>    - reachability (TCP)
>    - service working (HTTP 200 vs. 404, 500,...) (via active probes and
> via log monitoring. Increase in 500 status codes?)
>    - response times (significantly higher than usual?)
>    - data updated? (i.e. onionoo data older than 4-5 hours should
> trigger an alert)
>    - minimal sanity checks (i.e. /details should contain more than 5k
> relay records) [KL: note that we wouldn't have to fetch 5k records for
> this, we could just parse relays_skipped.]

All of this could be implemented in the Nagios check.

> 4. Statistics (part of metrics-web)
>  - [deployed] metrics-web sends a short log twice per day,

Is the log secret? Is there a regex we can match on?

If we can publish the log and have it fetched by a Nagios plugin, no one
has to read them every time.

> 5. ExoneraTor
>  - [deployed] ExoneraTor sends a message when it finds an existing lock
> file, etc.

Does this happen often?

> 6. Website (Tor Metrics, plus Atlas, ExoneraTor, Compass etc. until
> they're migrated)

We should come up with a list of test URLs and expected responses,
response times, etc.

> 7. Bot

This could be complicated, as there are many functions in the bot. For
now I don't think that this needs to be considered, and we can revisit
if/when it moves to a Tor machine.

> 8. Notification service
>  - Learn when the notification service itself goes down!

What would we test for and how? This would depend on the tool.

I'd rather not start thinking about the exact tool just yet, but that
was a good list of options that we can think about in the future.

Thanks,
Iain.

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 512 bytes
Desc: OpenPGP digital signature
URL: <http://lists.torproject.org/pipermail/metrics-team/attachments/20171101/e4566981/attachment-0001.sig>