[tor-bugs] #32135 [Metrics/Statistics]: Write BridgeDB metrics parser and analyse existing data

Tor Bug Tracker & Wiki blackhole at torproject.org
Mon Dec 16 18:34:39 UTC 2019


#32135: Write BridgeDB metrics parser and analyse existing data
--------------------------------+--------------------------------
 Reporter:  phw                 |          Owner:  phw
     Type:  task                |         Status:  needs_revision
 Priority:  Medium              |      Milestone:
Component:  Metrics/Statistics  |        Version:
 Severity:  Normal              |     Resolution:
 Keywords:  s30-o21a1           |  Actual Points:
Parent ID:  #31274              |         Points:  2
 Reviewer:                      |        Sponsor:
--------------------------------+--------------------------------
Changes (by phw):

 * status:  needs_review => needs_revision


Comment:

 Thanks for your work on this!

 Replying to [comment:6 karsten]:
 > Okay, I finished a first [https://gitweb.torproject.org/user/karsten
 /metrics-
 web.git/commit/?h=task-32135&id=93f2500cabf22fdf03d109bb7855445b18afd62d
 patch] that processes BridgeDB metrics once per day to produce a .csv file
 and that adds two graphs to Tor Metrics. Can you please take a look at
 that patch, not regarding the Java/R code, but regarding user-facing
 documentation of the two new graphs? In particular, please take a look at
 the `TODO`s in that patch. (irl, I'll ask you to review a revised branch
 for the code portions once the documentation parts are all set.)
 [[br]]
 [https://gitweb.torproject.org/user/karsten/metrics-
 web.git/diff/src/main/resources/web/json/metrics.json?h=task-32135&id=93f2500cabf22fdf03d109bb7855445b18afd62d
 Commit 93f2500c]:

 For bridgedb-transport, I would change the title to:
 {{{
 "BridgeDB requests for each bridge type"
 }}}
 ...and the description to:
 {{{
 "<p>This graph shows the number BridgeDB requests for each bridge type.
 BridgeDB requests over Tor and unsuccessful requests (e.g., invalid emails
 or incorrect CAPTCHAs) are not included in these numbers.</p>"
 }}}

 For bridgedb-distribution, I would change the title to:
 {{{
 "BridgeDB requests for each distribution method"
 }}}
 ...and the description to:
 {{{
 "<p>This graph shows the number of BridgeDB requests for each distribution
 method. HTTPS requests over Tor and unsuccessful requests (e.g., invalid
 emails or incorrect CAPTCHAs) are not included in these numbers.</p>"
 }}}

 Here are my changes to [https://gitweb.torproject.org/user/karsten
 /metrics-web.git/diff/src/main/resources/web/jsps/reproducible-
 metrics.jsp?h=task-32135&id=93f2500cabf22fdf03d109bb7855445b18afd62d
 commit 93f2500c]:

 {{{
 <h3 id="bridgedb-stats" class="hover">BridgeDB requests
 <a href="#bridgedb-stats" class="anchor">#</a>
 </h3>

 <p>BridgeDB metrics contain aggregated information about requests to the
 BridgeDB service.  BridgeDB keeps track of each request per distribution
 method
 (HTTPS, moat, email), per bridge type (e.g., vanilla or obfs4) per country
 code
 or email provider (e.g., "ru" or "gmail") per request success ("success"
 or
 "fail"). Every 24 hours, BridgeDB writes these metrics to disk and then
 begins
 a new measurement interval.</p>

 <p>The following description applies to the following graph and
 tables:</p>

 <ul>
 <li>BridgeDB requests by bridge type<a href="/bridgedb-transport.html"
 class="btn btn-primary btn-xs"><i class="fa fa-chevron-right" aria-
 hidden="true"></i> graph</a></li>
 <li>BridgeDB requests by distribution <a href="/bridgedb-
 distribution.html" class="btn btn-primary btn-xs"><i class="fa fa-chevron-
 right" aria-hidden="true"></i> graph</a></li>
 </ul>

 <h4>Step 1: Parse BridgeDB metrics to obtain reported request numbers</h4>

 <p>Obtain BridgeDB metrics from <a href="/collector.html#type-bridgedb-
 metrics">CollecTor</a>.
 Refer to the <a href="https://gitweb.torproject.org/bridgedb.git/tree/doc
 /bridgedb-metrics-spec.txt">BridgeDB metrics specification</a> for details
 on the descriptor format.</p>

 <h4>Step 2: Skip requests coming in over Tor exits</h4>

 <p>Skip any request counts with <code>zz</code> as their
 <code>CC/EMAIL</code> metrics key part.  We use the <code>zz</code> pseudo
 country code for requests originating from Tor exit relays.  We're
 discarding
 these requests because <a href="https://bugs.torproject.org/32117">bots
 use the
 Tor network to crawl BridgeDB</a> and including bot requests would provide
 a
 false sense of how users interact with BridgeDB.  Note that BridgeDB
 maintains
 a separate distribution pool for requests coming from Tor exit relays.</p>

 <h4>Step 3: Aggregate requests by date, distribution method, and bridge
 type</h4>

 <p>BridgeDB metrics contain request numbers broken down by distribution
 method,
 bridge type, and a few more dimensions.  For our purposes we only care
 about
 total request numbers by date and either distribution method or bridge
 type.
 We're using request sums by these three dimensions as aggregates.  As date
 we're using the date of the BridgeDB metrics interval end.  If we
 encounter
 more than one BridgeDB metrics interval end on the same UTC date (which
 shouldn't be possible with an interval length of 24 hours), we arbitrarily
 keep
 whichever we process first.</p>

 </div>

 <div class="container">
 }}}
 I wasn't sure what `TODO If we're supposed to "unbin" numbers, this is
 probably where we should say that.` meant, so I deleted the line. Is this
 about the `bin_size/2` modification you mentioned above?

 In [https://gitweb.torproject.org/user/karsten/metrics-
 web.git/diff/src/main/resources/web/jsps/stats.jsp?h=task-32135&id=93f2500cabf22fdf03d109bb7855445b18afd62d
 commit 93f2500c], I would replace "transport" with "bridge type" (because
 we include vanilla, which is technically the absence of a transport
 protocol) and "distribution" with "distribution method". I would also
 change:
 {{{
 <li><b>transport:</b> Name of the pluggable transport protocol, which
 includes <code>"obfs2"</code>, <code>"obfs3"</code>, <code>"obfs4"</code>,
 <code>"scramblesuit"</code>, and <code>"fte"</code>, and which will change
 in the future.</li>
 }}}
 to
 {{{
 <li><b>transport:</b> Name of the bridge type, which includes
 <code>"vanilla"</code>, <code>"obfs2"</code>, <code>"obfs3"</code>,
 <code>"obfs4"</code>, <code>"scramblesuit"</code>, and <code>"fte"</code>,
 and which will change in the future.</li>
 }}}
 May may want to change the column's name to something like "bridge_type"
 but I think it's also ok to keep it.
 [[br]]
 > By the way, while reading your code, I found that you're only looking at
 BridgeDB metrics files in CollecTor's `recent/` directory. There's
 currently a (minor) bug in CollecTor where we never remove files from that
 directory. I'm going to fix that at some point, and then your script will
 only provide the latest three files. A possible fix would be to also
 process files in CollecTor's `archive/` directory. Not sure how much of an
 issue that is when these graphs exist on Tor Metrics, but I thought I
 should let you know.

--
Ticket URL: <https://trac.torproject.org/projects/tor/ticket/32135#comment:8>
Tor Bug Tracker & Wiki <https://trac.torproject.org/>
The Tor Project: anonymity online


More information about the tor-bugs mailing list