On Thu, Mar 10, 2022 at 08:33:07AM +0000, Georg Koppen wrote:
Hello!
As you might know we are doing regular (at the moment weekly) scans of exit nodes to find and help with misconfigurations or errors that have potentially serious effects for Tor network usability and performance. The results we got so far after over a year of scanning are roughly single digit numbers of exit relays per week having mostly DNS configuration issues (unbound crashed etc.)
However, this week we suddenly found almost 80 exit relays with malfunctioning DNS resolution[1] which was surprising. Additionally, after some of the servers got fixed the issue returned. DrWhax (thanks!) pointed us to a possible explanation twittered by the unredacted folks:
https://twitter.com/unredacted_org/status/1501458345219215363
It seems that someone (intentionally or not) is overwhelming unbound leading to DNS resolution issues for those exit operators that do run this local resolver, which we currently recommend.
I find it interesting that it is possible to crash/DoS unbound through Tor circuits to an exit relay. I would have assumed other factors would limit before unbound would. They posted some CPU graphs on the Twitter page, but it would have been interesting to see some requests/s numbers if someone has any to share.
We've opened a ticket[2] for further investigation, but I hope this email raises some awareness so that exit operators can keep and eye on the situation.
Feel free to add insights you have to the ticket. Additionally, I bet if someone would share how they do monitoring for such a problem on their exits then a lot of exit operators would be happily picking up that setup and the Tor network would win. :)
I'm using Grafana + Prometheus + node_exporter to monitor my relays. Grafana is a web UI for visualising data, Prometheus is a data collector that scrapes data from node_exporter and stores it for Grafana to fetch. node_exporter is a service that collects and presents a bunch of data on the same format as the new Tor metrics function.
(When I eventually get Tor daemons recent enough to get anything but emptiness out of the metrics port, I'll add them to Premetheus for scraping as well.)
Grafana is great and one can build dashboards that show pertinent information and give a good overview. It is also possible to configure alerts if metrics go outside of specified bounds. I have alerts configured to mail me for a few statistics.
When it comes to unbound monitoring, I use unbound_exporter from the letsencrypt project on Github[3]. It works the same way node_exporter does, but exports unbound metrics and can be scraped by Prometheus. To visualise the data, I use a pre-made dashboard for Grafana[4] that I have tweaked a bit.
Cordially, Andreas Kempe
[1] https://gitlab.torproject.org/tpo/network-health/team/-/issues/197 [2] https://gitlab.torproject.org/tpo/network-health/analysis/-/issues/30
[3]: https://github.com/letsencrypt/unbound_exporter [4]: https://grafana.com/grafana/dashboards/9604