[tor-bugs] #30006 [Applications/Quality Assurance and Testing]: Monitor "aliveness" of default bridges in Tor Browser

Wed Apr 3 20:12:04 UTC 2019

#30006: Monitor "aliveness" of default bridges in Tor Browser
-------------------------------------------------+-------------------------
 Reporter:  phw                                  |          Owner:  phw
     Type:  defect                               |         Status:
                                                 |  assigned
 Priority:  Medium                               |      Milestone:
Component:  Applications/Quality Assurance and   |        Version:
  Testing                                        |
 Severity:  Normal                               |     Resolution:
 Keywords:  default bridge                       |  Actual Points:
Parent ID:                                       |         Points:
 Reviewer:                                       |        Sponsor:
-------------------------------------------------+-------------------------

Comment (by anarcat):

 Some more details on how this works. Prometheus is just a
 scraping/alerting system and relies on "exporters" to do the work. For
 example, we have "node exporters" installed on every TPA machine which
 provide stats like disk, CPU, and memory usage and also have "apache
 exporters" which provide internal stats on webservers as well. Details of
 that deployment are in #29681.

 The exporter that seem to fit the bill of "probe a TCP port for liveness"
 seem to be the [https://github.com/prometheus/blackbox_exporter blackbox
 exporter]. It could be deployed on the Prometheus server and check each
 public tor bridge for reachability. The blackbox exporter is not very well
 documented (not surprising considering its name), so I found more
 documentation on how it works
 [https://utcc.utoronto.ca/~cks/space/blog/sysadmin/PrometheusBlackboxNotes
 here] and [https://michael.stapelberg.ch/posts/2016-01-01-prometheus-
 blackbox-exporter/ here].

 The example you pasted was ran on my home workstation, and was simply a
 matter of running:

 {{{
 apt install prometheus-blackbox-exporter
 }}}

 The exporter supports probing arbitrary hosts on the fly like this. The
 final targets would need to be added to the
 [https://github.com/prometheus/blackbox_exporter/blob/master/CONFIGURATION.md
 configuration file] (see also
 [https://github.com/prometheus/blackbox_exporter/blob/master/example.yml
 this example]). This could all be done somewhat automatically as well,
 with a cron job polling the list of bridges from some canonical location.

 The blackbox exporter is pretty powerful: in theory, we could make it do a
 simple send/expect dialog to verify the other end is really a Tor server,
 if that would be useful.

 Once the exporter is setup, the Prometheus server would be configured to
 scrape those metrics, which would be collected every "scrape interval"
 (currently 15 seconds).

 Note that we do not have alerting capabilities yet: this is still handled
 by Icinga (previously known as Nagios) (see #29864 and #29863 for that
 discussion). Instead, we could make a Grafana dashboard that displays
 those metrics. There are a few dashboards that exist already that process
 those metrics out of the box, but they would probably require at least
 some tweaking:

  * https://grafana.com/dashboards/5990
  * https://grafana.com/dashboards/5345
  * https://grafana.com/dashboards/7587
  * full list:
 https://grafana.com/dashboards?dataSource=prometheus&search=blackbox

 I'm not sure alerting is really a necessity. It might be sufficient to
 check that dashboardas part of the release process, for example.

 The open questions for me are:

  1. is this the metrics team responsability? or TPA?
  2. what is the canonical reference for the list of public bridges?
 [https://gitweb.torproject.org/builders/tor-browser-
 build.git/plain/projects/tor-browser/Bundle-Data/PTConfigs/bridge_prefs.js
 this javascript file]? how stable is that file format? do I need to parse
 it as javascript or can I get away with a regex?
  3. what is the threshold for failure? say we ping the bridge every 15
 seconds, how many failures per which time period is a considered a
 failure? an example would be less than 50% of probes in the last day, for
 example. we can also check for latency as well
  4. are latency metrics sensitive? currently, the Prometheus metrics are
 more or less publicly accessible, so if this is implemented, it would
 expose the latency of those hosts which could be leveraged for correlation
 attacks (although arguably *anyone* could run a similar setup and do a
 similar attack). if we are worried about this, a separate Prometheus
 server could be deployed with stronger security. (see also the discussion
 in #29863)

--
Ticket URL: <https://trac.torproject.org/projects/tor/ticket/30006#comment:2>
Tor Bug Tracker & Wiki <https://trac.torproject.org/>
The Tor Project: anonymity online