[tor-bugs] #26035 [Metrics/Statistics]: Streamline sample quantile types used in the various modules

Wed May 9 08:38:28 UTC 2018

#26035: Streamline sample quantile types used in the various modules
--------------------------------+------------------------------
 Reporter:  karsten             |          Owner:  metrics-team
     Type:  enhancement         |         Status:  new
 Priority:  Medium              |      Milestone:
Component:  Metrics/Statistics  |        Version:
 Severity:  Normal              |     Resolution:
 Keywords:                      |  Actual Points:
Parent ID:                      |         Points:
 Reviewer:                      |        Sponsor:  Sponsor13
--------------------------------+------------------------------

Comment (by karsten):

 Thanks for this thoughtful response! I have just a few more thoughts on
 this:

 I'm not entirely sure how that pseudocode handles `percentile` values of
 0.0 and 1.0. Is `val()` 0-indexed or 1-indexed? In either case, we'll end
 up outside of values with one of the two percentiles. (PostgreSQL's
 `PERCENTILE_CONT` does support percentiles 0.0 and 1.0.)

 I also looked more closely at types `R-1` and `R-2` and figured that
 PostgreSQL's `PERCENTILE_DISC` is `R-1`, because it does not produce any
 averages. So, PostgreSQL implements `R-1` and `R-7`.

 (By the way, when I refer to them as `R-x`, that's mainly to simplify our
 discussion here. I'm happy to specify them with their formulas and only
 mention that they are what R defines as type x in their `quantile()`
 function.)

 Regarding R's `median()`, function, that produces the same result as `R-2`
 and `R-7`, right?

 I wonder if, for the sake of simplicity, we should avoid using
 `PERCENTILE_DISC` (which we're not using yet, AFAIK) and only use
 `PERCENTILE_CONT` and R's `median()`. That is, use `R-7` everywhere.

 I do agree that interpolation between two integers representing user
 numbers doesn't make as much sense. But we can always truncate or round
 results, if we believe that integers are less confusing.

 (I could imagine that if we were to compute percentiles of truly discrete
 variables like the number of tires mounted on trucks, we wouldn't want to
 return 7, but only actual sample values. I don't think that we need to
 worry about that here.)

 Regarding Apache Commons Math, we're not using that yet, and I don't feel
 strongly about adding it as dependency or implementing this quite simple
 function ourselves, say, in metrics-lib. Worth adding tests, I guess.

 Regarding Python, I'm amending my statement above a little bit. It's true
 that we're going to replace our last remaining Python code. Still, if we
 want to make our numbers reproducible, we'll have to accept that many of
 our users will want to reproduce them using Python. We should at least
 take a brief look how this would work.

 So, your possible steps make sense. Is this something you'd like to work
 on?

--
Ticket URL: <https://trac.torproject.org/projects/tor/ticket/26035#comment:2>
Tor Bug Tracker & Wiki <https://trac.torproject.org/>
The Tor Project: anonymity online