On 2018-03-16 19:26, nusenu wrote:
> would be great if you could reply to metrics-alerts notifications with the
> reason for the outage once it is solved.
> I'd like to collect the reasons, maybe we can use
> them to improve and reduce the outages.

In most cases the reason has been that the machine hosting the CollecTor
virtual machine had an issue at a time when no sysadmin was around.

One option to improve the situation is to move to a host that is more
stable than the one we're currently on. This may be as simple as asking
to move the virtual machine elsewhere, but only if there's another host

Another option is to stop relying as much on a single host. We already
made a huge step into this direction by syncing from a backup CollecTor
instance, which is also the reason why we're not losing data. But this
doesn't solve the issue if the primary CollecTor instance goes down.
Adding even more redundancy requires writing more code, which is
something we might not have the time for in the next 6 months.

Maybe there are more other, simpler options.

I'd say let's discuss this more at this week's team meeting.

