[tor-bugs] #16555 [Metrics Website]: Make user statistics more robust against outliers

Tor Bug Tracker & Wiki blackhole at torproject.org
Sat Jul 11 15:50:19 UTC 2015


#16555: Make user statistics more robust against outliers
-----------------------------+---------------------
 Reporter:  karsten          |          Owner:
     Type:  defect           |         Status:  new
 Priority:  normal           |      Milestone:
Component:  Metrics Website  |        Version:
 Keywords:                   |  Actual Points:
Parent ID:                   |         Points:
-----------------------------+---------------------
 '''tl;wr:''' From June 11 to 13, 2015, the [https://metrics.torproject.org
 /userstats-bridge-country.html?graph=userstats-bridge-
 country&start=2015-06-01&end=2015-06-30&country=all number of bridge
 users] briefly went up from around 20k to 140k.  A closer investigation of
 the underlying data revealed that the aggregate statistics reported by a
 single bridge were responsible for this major spike.  The
 [https://research.torproject.org/techreports/counting-daily-bridge-
 users-2012-10-24.pdf estimation method used for user statistics] should be
 made robust against outliers, possibly by applying the more recently
 developed [https://research.torproject.org/techreports/extrapolating-
 hidserv-stats-2015-01-31.pdf techniques that are used to extrapolate
 hidden-service statistics].

 Here are more details about that single bridge reporting almost
 unbelievable high statistics: It's the bridge with nickname
 "solemnizersfiaun" and hashed fingerprint
 [https://globe.torproject.org/#/bridge/420C39C86B0E71F653E18552B28B9189DA2F1377
 420C39C86B0E71F653E18552B28B9189DA2F1377] that reported to have served up
 to 80k users.  But from the bandwidth statistics it looks like that bridge
 actually answered a huge number of consensus requests during those days in
 June.  It pushed up to 20 MB/s, which is probably rather unusual for a
 bridge.  A closer look at the descriptor tells us that most of these bytes
 were used to answer directory requests.  (I didn't do the math whether a
 such a burst over a few hours would be sufficient to write 800k compressed
 consensuses.)  So, either the bridge is telling us the truth, or it's
 lying to us in a very sophisticated way.

 And it's not only that bridge that reported very high statistics in June.
 There's another bridge with nickname "Unnamed" and hashed fingerprint
 [https://globe.torproject.org/#/bridge/82F37B9A8400A1E0C0730D8E4639150AE11AC640
 82F37B9A8400A1E0C0730D8E4639150AE11AC640] that reported to have served
 around 10k users on June 18 and 22.  Similarly, that bridge reported
 extremely high traffic during those days.  I didn't look for more bridges,
 but it's possible that there were more that reported unusual numbers that
 didn't stand out as much as these.

 So, I'm not sure if we'll find out what exactly happened there, but it
 seems very unrealistic that these directory requests were generated by
 actual human users.  That's why I think we should remove these outliers in
 our estimation method.

--
Ticket URL: <https://trac.torproject.org/projects/tor/ticket/16555>
Tor Bug Tracker & Wiki <https://trac.torproject.org/>
The Tor Project: anonymity online


More information about the tor-bugs mailing list