[metrics-bugs] #26035 [Metrics/Statistics]: Streamline sample quantile types used in the various modules

Tor Bug Tracker & Wiki blackhole at torproject.org
Thu May 24 19:32:47 UTC 2018


#26035: Streamline sample quantile types used in the various modules
--------------------------------+--------------------------------
 Reporter:  karsten             |          Owner:  karsten
     Type:  enhancement         |         Status:  needs_revision
 Priority:  High                |      Milestone:
Component:  Metrics/Statistics  |        Version:
 Severity:  Normal              |     Resolution:
 Keywords:                      |  Actual Points:
Parent ID:                      |         Points:
 Reviewer:                      |        Sponsor:  Sponsor13
--------------------------------+--------------------------------
Changes (by iwakeh):

 * status:  needs_review => needs_revision


Comment:

 Replying to [comment:15 karsten]:
 > Replying to [comment:14 iwakeh]:
 > > Taking you up on your offer from comment:13, so I can concentrate on
 reviews and tickets of CollecTor.
 >
 > Alright, happy to implement this change.

 Thanks!

 >
 > Please review [https://gitweb.torproject.org/karsten/metrics-
 web.git/log/?h=task-26035 my task-26035 branch] with three commits:
 >
 >  - [https://gitweb.torproject.org/karsten/metrics-
 web.git/commit/?h=task-26035&id=4f92894a1ee5315b9e4a17b38f3cdb229612f0f1
 4f92894] changes how we're computing median and inter-quartile range in
 the censorship detector code, which is still written in Python. I tested
 the change by running on our user number estimates. I found that it
 changes 159 of 2447 days in our data (6.5%) and leaves the remaining days
 entirely unchanged. This also makes sense: with a slightly different
 median and inter-quartile range we either include a value or exclude it as
 outlier. I'd say we cannot conclude that one of the implementations is
 correct and the other is not. The new implementation will simply be more
 consistent throughout our code base.

 This looks fine and will make it easier to transfer this into Java later.

 >
 >  - [https://gitweb.torproject.org/karsten/metrics-
 web.git/commit/?h=task-26035&id=2685c78f13cbf9402d5ba0b4380df03f246e86e5
 2685c78] makes the same change to our advertised bandwidth statistics.
 Obviously, this changes results a bit, because we're now interpolating
 between actually reported advertised bandwidths rather than returning a
 value that was actually reported by one of the relays. Still, for the sake
 of consistency throughout our code base, we should switch.
 >
 >  - [https://gitweb.torproject.org/karsten/metrics-
 web.git/commit/?h=task-26035&id=f9c24cab1006bf5999c662e9d06767c59c71a3e6
 f9c24ca] makes the third change in this series, this time to the
 connbidirect module. The change is quite significant in years 2011 and
 2012 where we had just a handful of relays reporting these statistics.
 Then it does make a difference whether we're interpolating or not. Same
 argument in favor of doing it now.

 The advbwdist module has a new static method, which should be made more
 visible and thus facilitate re-use as well as testing.
 Of course, for re-use it needs to be made more generic and maybe also
 placed in a different class (maybe `**.stats.Utiliy`).

 Remarks & suggestions in no particular order:
 * the sorting step in advbwdist changes an input parameter, which is bad
 practice.
 * commons-math Percentile class doesn't require the input data to be
 sorted. (The javadoc comment only talks about sorting in order to explain
 what will happen for edge cases.)
 * Maybe rather use `doubleValue()` instead of casting a Number sub-type to
 a primitive.
 * Casting of percentile results could be performed by the caller, which
 could guarantee that there are only values of for example type short
 entered (see connbiderect).  Or, provide special utility methods that re-
 use code internally.
 * connbidirect uses similar code as advbwdist for almost the same
 computations. The input fraction list also get changed by the unnecessary
 sorting step (this might not matter in that case, but still)
 * The Java re-implementation of the python detector will also benefit from
 a percentile function.
 * The percentile input parameter `int[] percentiles` could be changed to
 `int ... percentiles`.

 Encapsulation and testability of this type of functionality that is needed
 throughout the code is essential and will also make documentation now and
 in future much easier.
 The functionality should especially be tested b/c of the large impact such
 changes have, i.e., re-computation of everything.  This should be revised.

--
Ticket URL: <https://trac.torproject.org/projects/tor/ticket/26035#comment:16>
Tor Bug Tracker & Wiki <https://trac.torproject.org/>
The Tor Project: anonymity online


More information about the metrics-bugs mailing list