[metrics-team] collecting onionoo outage reason stats
karsten at torproject.org
Tue Apr 3 15:38:03 UTC 2018
On 2018-03-19 11:50, Karsten Loesing wrote:
> On 2018-03-16 19:26, nusenu wrote:
> Hi nusenu,
>> would be great if you could reply to metrics-alerts notifications with the
>> reason for the outage once it is solved.
>> I'd like to collect the reasons, maybe we can use
>> them to improve and reduce the outages.
> In most cases the reason has been that the machine hosting the CollecTor
> virtual machine had an issue at a time when no sysadmin was around.
> One option to improve the situation is to move to a host that is more
> stable than the one we're currently on. This may be as simple as asking
> to move the virtual machine elsewhere, but only if there's another host
> Another option is to stop relying as much on a single host. We already
> made a huge step into this direction by syncing from a backup CollecTor
> instance, which is also the reason why we're not losing data. But this
> doesn't solve the issue if the primary CollecTor instance goes down.
> Adding even more redundancy requires writing more code, which is
> something we might not have the time for in the next 6 months.
> Maybe there are more other, simpler options.
> I'd say let's discuss this more at this week's team meeting.
Quick update: at last week's team meeting we discussed another option,
which is to have Onionoo fetch descriptors from two Collector instances.
If one of them goes down temporarily, Onionoo won't be affected. I wrote
some code to do this, and early results look promising. See ticket
#25700 for details.
All the best,
> All the best,
-------------- next part --------------
A non-text attachment was scrubbed...
Size: 528 bytes
Desc: OpenPGP digital signature
More information about the metrics-team