[tor-bugs] #26035 [Metrics/Statistics]: Streamline sample quantile types used in the various modules

Tue May 8 15:34:47 UTC 2018

#26035: Streamline sample quantile types used in the various modules
--------------------------------+------------------------------
 Reporter:  karsten             |          Owner:  metrics-team
     Type:  enhancement         |         Status:  new
 Priority:  Medium              |      Milestone:
Component:  Metrics/Statistics  |        Version:
 Severity:  Normal              |     Resolution:
 Keywords:                      |  Actual Points:
Parent ID:                      |         Points:
 Reviewer:                      |        Sponsor:  Sponsor13
--------------------------------+------------------------------

Comment (by iwakeh):

 Long description, true.

 Let's look at the postgresql implementation (using source of 9.6.8, but
 assuming it is not subject to frequent change,
 ./src/backend/utils/adt/orderedsetaggs.c).

 The computation of percentile_* as C-ish pseudo-code (which could easily
 be used for implementation in Java or python):
 {{{
 #!C
 /*
    All values are sorted and indexed according to the order.
    first and second are indices; val(k) is the value at index k;
    percentile is the wanted percentile.  N is the count of values.
 */

 first = floor(percentile * N);
 second = ceil(percentile * N);
 /* if first==second the value of first is used. */
 if (first==second) {
   result = val(first);
 } else { /* if first and second differ the following interpolation is
 used. */
   proportion = (percentile * N) - first;
   /* the value is chosen between the values of first and second */
   result = val(first) + (proportion * (val(second) - val(first)));
 }

 /*-----------------------------------*/
 /* For comparison percentile_disc:  */
 result = val( ceil(N*percentile));
 }}}

 For values, where fractions make sense, e.g. seconds for onionperf
 results, the interpolation or continuous method could be used and for all
 else, e.g. user numbers etc., the simpler calculation (refered to as
 discontinuous) could be used.

 The classification R-x is not really a DIN and I wouldn't rely on an
 implementation without checking the source.
 Thus, using our own implementation in Java might be less trouble and
 smaller dependency counts/space used.  Python is not used anymore soon,
 and R is only used for the median and either could document the median
 calculation in R or implement the 0.5-percentile.

 Possible steps:
 * keep using postgresql percentile_cont
 * implement the equivalent Java functionality and replace commons-math, if
 this is the only functionality why it is included.
 * document percentile calculation once for Java together with postgresql
 * adapt the R median calculation, if necessary

--
Ticket URL: <https://trac.torproject.org/projects/tor/ticket/26035#comment:1>
Tor Bug Tracker & Wiki <https://trac.torproject.org/>
The Tor Project: anonymity online