[metrics-bugs] #26022 [Metrics/Statistics]: Fix a flaw in the noise-removing code in our onion service statistics

Tor Bug Tracker & Wiki blackhole at torproject.org
Wed May 16 14:00:04 UTC 2018


#26022: Fix a flaw in the noise-removing code in our onion service statistics
--------------------------------+------------------------------
 Reporter:  karsten             |          Owner:  metrics-team
     Type:  defect              |         Status:  needs_review
 Priority:  Medium              |      Milestone:
Component:  Metrics/Statistics  |        Version:
 Severity:  Normal              |     Resolution:
 Keywords:                      |  Actual Points:
Parent ID:                      |         Points:
 Reviewer:                      |        Sponsor:
--------------------------------+------------------------------

Comment (by karsten):

 Replying to [comment:10 amj703]:
 > > Hmm, I believe that disallowing negative values by rounding them up to
 zero would bias the result unnecessarily. For example, imagine an extreme
 case where we picked a large `bin_size` value and an even larger `delta_f`
 value: most relays would observe values between 0 and `bin_size`, and they
 would add very large positive or negative noise values. In such a case we
 need both negative and positive values to come up with a result close to
 0.
 >
 > Sure, that makes sense as a reason to try and produce unbiased estimates
 for each relay’s value. It seems to me that there is another way to do
 this, which would involve adding the relay values first and then adjusting
 the result, given that the noise has already been summed and thus has zero
 bias (aka an expectation of zero). But the current method seems to be
 working just fine!

 We could sum up relay values first and then adjust the result. However,
 we'd lose the ability to discard outliers, which we're doing extensively
 with onion service statistics. After all, we're throwing out 2 times 25%
 of reported values which we'd then include again.

 > > Hmm, I'm not so sure about this one, either. Remember that we
 implemented the code at relays to report the ''right side'' of a bin, not
 the ''center''. Take another example where a relay observes exactly 0
 events over two days: on day 1 it adds a small positive noise value and
 reports `bin_size` as number, and on day 2 it adds a small negative noise
 value and reports 0 as number. If we subtract `bin_size / 2` from those
 reported values, we'll end up with 0 on average, which would be correct.
 But if we didn't, we'd end up with `bin_size / 2` as average, which seems
 wrong.
 >
 > As another example, consider that on day 1 a positive noise value in
 (bin_size/2, bin_size] is added, which results in bin_size being reported,
 and on day 2 a large-magnitude negative value in [-bin_size, -bin_size/2)
 is reported, which results in -bin_size being reported. Then subtracting
 bin_size/2 from those reported values ends up with -bin_size/2 as the
 average, which seems wrong.

 Hang on. Relays always round ''up'' to the next multiple of `bin_size`.
 So, everything in `(-bin_size, 0]` will be reported as `0` and ''not'' as
 `-bin_size`.

 > I don’t think the “right side” rounding is happening with current use of
 the floor function, if it ever was. Maybe I’m wrong, but as I understand
 it Math.floorDiv((reportedNumber + binSize / 2) will round -0.75*binSize
 to -binSize.

 This part is correct. (The full "formula" is
 `Math.floorDiv((reportedNumber + binSize / 2), binSize) * binSize`.)

 > > Does this make sense? If so, I think I'd prefer to leave the general
 formula to remove noise unchanged and only focus on fixing the
 implementation bug related to integer truncation.
 >
 > It definitely makes sense to just focus on the immediate issue
 discovered. I was just thinking through how this was supposed to work
 again and thought I would let you know how it appeared to me.

 Sure! It's good to revisit our design decisions from a few years back and
 to fix any other flaws coming up.

 So, if the truncate/floor bug remains the only flaw in our current code,
 and when the reprocessed numbers are finally done here, I'll merge and
 update the numbers.

 Thanks so much for sharing your thoughts here!

--
Ticket URL: <https://trac.torproject.org/projects/tor/ticket/26022#comment:11>
Tor Bug Tracker & Wiki <https://trac.torproject.org/>
The Tor Project: anonymity online


More information about the metrics-bugs mailing list