[metrics-team] collecting onionoo outage reason stats

Karsten Loesing karsten at torproject.org
Mon Mar 19 10:50:23 UTC 2018


On 2018-03-16 19:26, nusenu wrote:
> Hi,

Hi nusenu,

> would be great if you could reply to metrics-alerts notifications with the
> reason for the outage once it is solved.
> I'd like to collect the reasons, maybe we can use
> them to improve and reduce the outages.

In most cases the reason has been that the machine hosting the CollecTor
virtual machine had an issue at a time when no sysadmin was around.

One option to improve the situation is to move to a host that is more
stable than the one we're currently on. This may be as simple as asking
to move the virtual machine elsewhere, but only if there's another host
available.

Another option is to stop relying as much on a single host. We already
made a huge step into this direction by syncing from a backup CollecTor
instance, which is also the reason why we're not losing data. But this
doesn't solve the issue if the primary CollecTor instance goes down.
Adding even more redundancy requires writing more code, which is
something we might not have the time for in the next 6 months.

Maybe there are more other, simpler options.

I'd say let's discuss this more at this week's team meeting.

> thanks,
> nusenu

All the best,
Karsten

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 528 bytes
Desc: OpenPGP digital signature
URL: <http://lists.torproject.org/pipermail/metrics-team/attachments/20180319/c960dfe4/attachment.sig>


More information about the metrics-team mailing list