[tor-bugs] #18167 [Metrics]: Don't trust "bridge-ips" blindly for user number estimates

Tor Bug Tracker & Wiki blackhole at torproject.org
Wed Jan 27 19:38:37 UTC 2016


#18167: Don't trust "bridge-ips" blindly for user number estimates
-------------------------+-----------------
     Reporter:  karsten  |      Owner:
         Type:  defect   |     Status:  new
     Priority:  Medium   |  Milestone:
    Component:  Metrics  |    Version:
     Severity:  Major    |   Keywords:
Actual Points:           |  Parent ID:
       Points:           |    Sponsor:
-------------------------+-----------------
 I think I found a bug in the user number estimates that led to the
 [https://trac.torproject.org/projects/tor/ticket/13171#comment:14
 confusion on #13171].

 When I developed the [https://research.torproject.org/techreports
 /counting-daily-bridge-users-2012-10-24.pdf algorithm for estimating user
 numbers], bridges only reported how many directory requests they responded
 to (`"dirreq-v3-resp"`), but not how these directory requests were
 distributed to countries (`"dirreq-v3-reqs"`).  What they did report was
 how many different IP addresses by country connected to the bridge
 (`"bridge-ips"`).  The goal back then was to provide better user numbers
 per country, so I put in the assumption that the geographic distributions
 of directory responses and connecting IP addresses would be roughly the
 same.  And I think that assumption is still valid for most cases.

 However, the meek version ''before'' the #13171 fix broke this assumption.
 Here's an example from a meek bridge that didn't have this fix yet
 (descriptor digest `462a2bcc..`):

 {{{
 extra-info UtahMeekBridge 88F745840F47CE0C6A4FE61D827950B06F9E4534
 published 2015-12-09 22:53:48
 dirreq-v3-resp ok=17656,not-enough-sigs=0,unavailable=0,not-found=0,not-
 modified=6160,busy=0
 bridge-ips de=16,cn=8,us=8
 }}}

 It's rather unlikely that 17656 responses were sent back to 32 IP
 addresses or less.  Still, following the assumption above, we're saying
 that half of those 17656 responses were sent back to Germany and one
 quarter each to China and the U.S.A., and that seems dangerously wrong.

 I'm going to attach a scatter plot in a minute, `dirreq-resp-by-bridge-
 ips-2016-01-27.png`, that puts the numbers of `"dirreq-v3-resp ok=..."`
 and `"bridge-ips"` in relation for statistics reported between December 1,
 2015 and last week.  The two meek bridges `88F7..` and `AA03..` stand out
 quite a bit there as clusters close to the y axis.

 I have a few possible fixes in mind.  The first part would be to ignore
 all statistics where 1 unique IP address was reported to make, say, 10
 directory requests or more.  That would remove all dots to the left of the
 dashed line in the graph.

 The second part of the fix would be to switch from combining
 `"dirreq-v3-resp"` and `"bridge-ips"` numbers and instead use reported
 distributions of directory requests to countries (`"dirreq-v3-reqs"`) that
 were not available 3.5 years ago.  But
 [https://trac.torproject.org/projects/tor/ticket/5824#comment:17 starting
 roughly 2 years ago], these statistics are being published by more and
 more bridges.

 Here's a descriptor (`fe171d40..`) that was published last week by the
 same bridge as above, now named `MeekGoogle`, which was after the meek-
 specific #13171 fix:

 {{{
 extra-info MeekGoogle 88F745840F47CE0C6A4FE61D827950B06F9E4534
 published 2016-01-22 13:11:10
 dirreq-v3-reqs us=7200,ru=1576,de=1520,[..],cn=88,[..]
 dirreq-v3-resp ok=22016,not-enough-sigs=0,unavailable=0,not-found=0,not-
 modified=6016,busy=0
 bridge-ips us=3016,ru=632,gb=536,de=528,[..],cn=40,[..]
 bridge-ip-versions v4=8752,v6=64
 bridge-ip-transports <OR>=8,meek=8808
 }}}

 I'm attaching a second scatter plot, `dirreq-resp-by-dirreq-
 reqs-2016-01-27.png`, that compares the numbers of `"dirreq-v3-resp
 ok=..."` to `"dirreq-v3-reqs"`.  The correlation is close to linear, which
 makes sense, because the number of directory requests should roughly match
 the number of directory responses.  I think we can make the user number
 estimates a bit more accurate by making this switch.  We would still fall
 back to `"bridge-ips"` if `"dirreq-v3-reqs"` is empty, but that would
 mostly affect older statistics.

 Part three of the plan would be to remove the `"bridge-ips"` line entirely
 from little-t-tor, because we wouldn't use it anymore.  It's worth noting
 that we'd lose the ability to filter out meek bridges that don't have the
 #13171 fix and that don't report usable `"dirreq-v3-reqs"` statistics.  Or
 rather, we wouldn't spot future meek-like bridges affected by a similar
 bug.

 Here's why.  The first bridge descriptor above also contained a
 `"dirreq-v3-reqs"` line that I left out before:

 {{{
 extra-info UtahMeekBridge 88F745840F47CE0C6A4FE61D827950B06F9E4534
 published 2015-12-09 22:53:48
 dirreq-v3-resp ok=17656,not-enough-sigs=0,unavailable=0,not-found=0,not-
 modified=6160,busy=0
 dirreq-v3-reqs us=17648,cn=8
 bridge-ips de=16,cn=8,us=8
 }}}

 We wouldn't be able to filter out this bridge without the `"bridge-ips"`
 line.  We would have to assume that the vast majority of requests to this
 bridge came from the U.S.A., and a tiny minority from China.

 I think this is acceptable, because the purpose of statistics shouldn't be
 to validate the correctness of other statistics.

 To summarize my plan, here's what I'd like to do:

  1. If a bridge reports both a `"dirreq-v3-resp`" and a `"bridge-ips"`
 line, check if the first number is smaller than 10 times the second
 number; if not, ignore these directory-request statistics reported by this
 bridge.

  2. If a bridge only reports a `"bridge-ips"` line and no
 `"dirreq-v3-reqs"` line, assume that the country distributions are the
 same, which is what we're doing right now.

  3. If a bridge reports a `"dirreq-v3-reqs"` line, use that for user
 number estimates and ignore the `"bridge-ips"` line in case it's present.

 Hope this report was not too confusing.  Feedback much appreciated.

--
Ticket URL: <https://trac.torproject.org/projects/tor/ticket/18167>
Tor Bug Tracker & Wiki <https://trac.torproject.org/>
The Tor Project: anonymity online


More information about the tor-bugs mailing list