Hi nusenu,
On 2018-01-22 18:57, nusenu wrote:
Looks like the primary CollecTor instance had a problem between 22:00 and 08:00 UTC. It works again now, as does Onionoo.
Karsten, thanks for the fast reaction.
We didn't lose any data, because the primary CollecTor instance obtained all descriptors it had missed earlier from the backup CollecTor instance.
Since I'm archiving onionoo data I'm "loosing" data (causing blind spots) everytime a "relays_published" timestamp is skipped. In theory one could spin up an onionoo instance to generate data for skipped timestamps but in practice this is hard (requires lots of resources). (I know, you are probably talking about not loosing any raw CollecTor data, but wanted to mention that nonetheless.)
Right, I meant not losing any raw CollecTor data. Your use case of archiving Onionoo data is special. It's okay that you do this, but it's not what Onionoo was designed for. Most people will find Onionoo data that is 6 or 12 hours behind still useful. But if we had lost 6 or 12 hours of CollecTor data, that would have been pretty bad.
What we can do, though, is think about providing more history in Onionoo, so that you can give up on archiving Onionoo data. After all, Onionoo already provides quite some history, including graph data like in bandwidth documents and others, times when a relay last changed its IP address or port, the time it was first seen, and so on. If you have ideas what else would be valuable to have history for, please open a ticket.
Do you monitor onionoo for such problems ("relays_published" timestamp remaining unchanged for >1-2 hours)? Would you find something like that useful?
We do have such monitoring, yes. Here's the Nagios script we're using:
https://gitweb.torproject.org/admin/tor-nagios.git/tree/tor-nagios-checks/ch...
Thanks for keeping it running besides all the other things you do.
I'm wondering if the admin team would be available to cover such cases to reduce the operations load for developers.
The admin team already handles operational issues with the hosts, though the metrics team is still in charge for running the services. I think that's a fine separation, and it has worked quite well for the last couple of years.
kind regards, nusenu
All the best, Karsten