[metrics-team] Brainstorming better notifications for operational issues on Wednesday, November 1, 14:30 UTC

Thu Nov 2 10:22:27 UTC 2017

On 2017-11-01 17:48, Iain R. Learmonth wrote:
> Hi,

Hi!

> On 01/11/17 15:47, Karsten Loesing wrote:
>> The following notes are the result of a 1.25-hour brainstorming session.
>> They should not be seen as final decisions on anything, but rather as
>> starting points for further discussions.
> 
> Sorry I missed this. I'm still getting the hang of this daylight savings
> time thing.

No worries. And thanks for adding your thoughts below. I'll respond to
some of them inline, but we should probably take this thread to another
pad session, with all those open questions.

>> How we could monitor our services?
>> 1. Log warnings or errors whenever we realize we're having an issue, and
>> somehow send those errors/warnings via email.
> 
> metrics-bot can listen on an Onion service and then relay notifications
> to IRC. I'm planning to move the "production" metrics-bot to a DO
> droplet soon so it would also be possible to have the notifications sent
> over the Internet (ACL restricted).

(I'm not entirely sure what you have in mind here.)

>> 2. Periodically request public resources via web interfaces and perform
>> basic smoke tests, without adding specific information just for the sake
>> of better monitoring.
> 
> metrics-bot already does this for generating microblog status updates
> (from Onionoo) and will start doing this for metrics-web CSV files in
> the future. It's only doing it for things that it consumes though,
> monitoring is a side-effect.

Sounds good.

>> 3. Locally run checks on the hosts, including whether a given process is
>> still running.
> 
> We can use Nagios Remote Plugin Executor (NRPE) for this if the sysadmin
> team is happy with that.

Fine question, we could find out.

>> 1. CollecTor
>>  - notifications about errors or warnings in logs
> 
> Is there a regular expression we can match on?

We could probably create one, yes.

>>  - learn when the disk almost runs full (currently provided by Tor's
>> Nagios and by a warning in the logs
> Is this using NRPE already?

That's again a question for the admins that we should ask them when we
have a better idea what we need.

>>  - learn when a collector process has died, either by checking locally
>> whether the process still exists, by looking at logs for regular
>> info/notice level entries, or by fetching the index.json and looking
>> whether the "index_created" timestamp is older than 30 minutes/3 hours
> 
> We can write a Nagios check for this. It would look very similar to the
> existing check for Onionoo (fetching and parsing JSON).

True. I'm a big fan of that idea, because it doesn't require us to make
any changes on existing instances.

I think iwakeh is more in favor of doing something with logs, which
would allow us to monitor things more closely but which require access
to the hosts.

>>  - learn when a data source has become stale by looking at
>> "last_modified" timestamps contained in index.json or by looking at the logs
> 
> As above.
> 
>> 2. OnionPerf
>>  - Does one or more of the OnionPerf hosts not report recent measurements?
> 
> As above, but parsing the HTML (my preference would be to do this with
> bs4, it's in Debian stable).
> 
>> 3. Onionoo
>>  - [deployed] Onionoo has a Nagios warning that fetches a minimal
>> response and checks timestamps (which is the only way how we notice
>> problems with the bridge authority), but cf. #23984
>>  - nusenu suggests via email (mostly as an onionoo user):
>>    - reachability (TCP)
>>    - service working (HTTP 200 vs. 404, 500,...) (via active probes and
>> via log monitoring. Increase in 500 status codes?)
>>    - response times (significantly higher than usual?)
>>    - data updated? (i.e. onionoo data older than 4-5 hours should
>> trigger an alert)
>>    - minimal sanity checks (i.e. /details should contain more than 5k
>> relay records) [KL: note that we wouldn't have to fetch 5k records for
>> this, we could just parse relays_skipped.]
> 
> All of this could be implemented in the Nagios check.

Agreed.

>> 4. Statistics (part of metrics-web)
>>  - [deployed] metrics-web sends a short log twice per day,
> 
> Is the log secret?

Fine question! Maybe! It may contain parts that we found too sensitive
to keep in sanitized descriptors, and those are certainly secret. We
could split up such log messages into secret ones on info level and
non-secret ones on warn level, and only publish warn and error logs. But
we might miss something there. Maybe we should assume that logs remain
secret.

> Is there a regex we can match on?

Not really. It's log output from various tools. But I think that's
nothing we should attempt to solve in the monitoring tool, it's
something we need to solve by cleaning up metrics-web more first.

> If we can publish the log and have it fetched by a Nagios plugin, no one
> has to read them every time.
> 
>> 5. ExoneraTor
>>  - [deployed] ExoneraTor sends a message when it finds an existing lock
>> file, etc.
> 
> Does this happen often?

Only when it breaks. Every few months?

>> 6. Website (Tor Metrics, plus Atlas, ExoneraTor, Compass etc. until
>> they're migrated)
> 
> We should come up with a list of test URLs and expected responses,
> response times, etc.

Yes, good idea.

>> 7. Bot
> 
> This could be complicated, as there are many functions in the bot. For
> now I don't think that this needs to be considered, and we can revisit
> if/when it moves to a Tor machine.

Okay. I guess I thought of something very simple, like seeing if it's
still alive, just like the website checks above. But, happy to keep this
out for now.

>> 8. Notification service
>>  - Learn when the notification service itself goes down!
> 
> What would we test for and how? This would depend on the tool.

Test that the notification service is still alive. Bad news if it dies
and we don't get any notifications about all the other stuff.

> I'd rather not start thinking about the exact tool just yet, but that
> was a good list of options that we can think about in the future.

Sounds good!

Let's schedule a follow-up meeting to move this forward. I'll bring this
up today at the team meeting (attention: the UTC time stayed the same,
your clock may have changed! :))

> Thanks,
> Iain.

All the best,
Karsten

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 528 bytes
Desc: OpenPGP digital signature
URL: <http://lists.torproject.org/pipermail/metrics-team/attachments/20171102/a2eeabce/attachment.sig>