[metrics-bugs] #26035 [Metrics/Statistics]: Streamline sample quantile types used in the various modules

Sat May 12 08:20:12 UTC 2018

#26035: Streamline sample quantile types used in the various modules
--------------------------------+---------------------------
 Reporter:  karsten             |          Owner:  iwakeh
     Type:  enhancement         |         Status:  accepted
 Priority:  Medium              |      Milestone:
Component:  Metrics/Statistics  |        Version:
 Severity:  Normal              |     Resolution:
 Keywords:                      |  Actual Points:
Parent ID:                      |         Points:
 Reviewer:                      |        Sponsor:  Sponsor13
--------------------------------+---------------------------

Comment (by karsten):

 Thanks, very useful! Let me first try to answer the open questions:

  - What's up with a) and c) using slightly different percentile
 implementations? The reason is that we're including the 0th (minimum) and
 100th percentile (maximum) in a) which we're not in c). It's totally
 possible that what we're using right now for a) is a terrible hack. Maybe
 we should instead use the formula for c) in a) and handle percentile 0 or
 100 as a special case. Whatever the other implementations do.

  - What's up with e) and f) not being quartiles? What we're doing there is
 that we're computing the ''weighted'' quartiles. And again, it might be
 that it's a hack that we should rewrite. The goal should be to implement a
 weighted trimmed mean. The technical report probably has a better
 definition. What we cannot do, though, is use the exact same percentile
 definition as we're using for the other places.

  - I think you left out the Python code that is our current censorship
 detector. Which is fine, as I see how we could change that code to match
 what we're doing elsewhere.

 So, I guess the decision we need to make is whether we want to use R-1 or
 R-7 everywhere, right?

 I'm slightly leaning towards R-7 here.

 One reason is that, if we used R-1, we couldn't use R's default `median()`
 anymore, because that interpolates. I found a non-interpolating median
 implementation in Python, called
 [https://docs.python.org/3/library/statistics.html#statistics.median_low
 median_low] (or median_high). And I think the Tor daemon uses a low median
 for some things related to directory authority voting. But I believe it's
 not the standard.

 So, if we use R-7, we should have good tool support.

 Except for Java where we'd have to implement something ourselves, which
 would also have to handle special cases 0 and 100.

 By the way, do you feel strongly about avoiding Apache Commons Math? We'd
 only have to add it to metrics-web, and it would save us half a day of
 writing code and testing it. After all, we also rely on libraries for
 things like base64 encoding, which is not rocket science to implement
 ourselves. We wouldn't have to add it to the metrics-web .war file!

 P.S.: Did I write something about trucks? I meant insect legs! Unless
 those have a spare leg mounted somewhere, too, in which case I'll think
 even harder about a good example. ;)

--
Ticket URL: <https://trac.torproject.org/projects/tor/ticket/26035#comment:5>
Tor Bug Tracker & Wiki <https://trac.torproject.org/>
The Tor Project: anonymity online