[metrics-team] New way of counting Direct Users per country

Fri May 6 13:11:06 UTC 2016

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Hi David,

thanks for working on improving our user number estimates!  And to be
clear, there's room for improvement.  So let's look your suggestions,
comments inline.

On 05/05/16 18:16, David Goulet wrote:
> Hi everyone!
> 
> (TL;DR; I describe an algorithm for counting direct users using
> dirreq requests and ignoring dirreq bytes history.)
> 
> Last week, me and armadev wanted to reproduce metrics results that
> is the daily number of users per country[1]. We looked at the tech
> report[2] and the metrics code[3].
> 
> It seems that we are counting direct users (not bridge) using the
> same technique as we do with bridge that is counting dirreq and
> extrapolating using bytes history. For bridges, it makes sense (see
> why in the tech report) but for direct users, statistics comes from
> relays in the consensus so _maybe_ there is a better approach of
> estimating the number of users per-country.

By better, do you mean the algorithm would be simpler and hence more
intuitive, or that results would be more accurate, or both, or
something else?

> I'll be describing what armadev and I came up with. Maybe it's
> crazy, maybe some pieces are missing, maybe it's not at all better
> then what metrics does. This is why I'm writing this email, see if
> all this makes senses.
> 
> 1) For each relay, we'll compute the BW fraction for the dirreq
> stats period (dirreq-stats-end) for the interval.

So, rather than directory-request byte histories you're using total
transferred byte histories?  Why would that be better (and I'm asking,
because I assume there's a reason I'm overlooking, not to imply that
it's not better)?  And what about relays that don't act as directory?

> We make a bandwidth average for that period (average of all bw
> values of a relay in that interval). We then divide that value by
> the total bandwidth at that time in the network: R1_bw / (R1_bw +
> R2_bw + ... + Rn_bw) ...where n is the total number of relays in
> the network.
> 
> We'll split that value in two in case the time period overlaps
> between two days (it actually happens all the time). Here is a
> great ascii art!! showing you the dirreq stats period P between the
> 4th and 5th of some month:
> 
> 4                      5                      6 
> +----------|-----------+----------|-----------+ P:
> ^_______________.______^
> 
> For the period P, we have 16 hours on the 4th and 8 hours on the
> 5th so using our BW fraction for a relay, we can split that
> fraction in two fractions for each day.
> 
> 2) For each relay, we count all requests per country using
> "dirreq-v3-reqs" from the extra-info document. Since the period
> overlap between days, we need to split in two as well like step 1).
> For instance, if we have 32 clients for "ao" then on the 4th we
> have (32 - 4) * (16/24) and on the 5th, we end up with (32 - 4) *
> (8/24).
> 
> (See technical report on why 4 is substracted here[4])
> 
> 3) At this step, for each relay reporting dirreq-v3-reqs stats, we
> have a BW fraction per day basically the chance of being picked by
> a client. We also have a count per country code of clients seen per
> day as well.
> 
> For a relay, take the per-country per-day client number, divide it
> by the bw fraction and then divide it by 10 (again see tech report
> on why but basically we estimate a client, over 24h, will do
> between 8 and 12 directory requests). Suming up that value for each
> relay gives us the final client number for that country.
> 
> For the 4th: R1: (cc-users[4th] / bw_fraction[4th]) / 10

That would be for reports by one relay, right?  How do you combine all
reports for one day?

> Relay already have the number of clients they've seen per country
> so the approach here is super simple, take advantage of that and
> extrapolate using the relay weight during the stats period.
> 
> Maybe this is over simplistic, maybe it's been thought out before.
> However, the results is an interesting part. For March 5th of 2016,
> here are the two estimate for the "de" country, from metrics[5] and
> this algorithm:
> 
> Metrics: 2016-03-05,relay,de,,,158830,204819,183596,71 --> 183596
> is the number of estimate clients.
> 
> Email: 2016-03-05 - 95745 estimated clients.
> 
> As you can see, the difference is almost half! I ran the numbers
> for other smaller countries and we are closer to what metrics says
> usually with countries < 10k users. For instance Iran "ir":
> 
> Metrics: 2016-03-05,relay,ir,,,5748,7971,7044,71 --> 7044 is the
> number of estimate clients
> 
> Email: 2016-03-05 - 5913 estimated clients.

Glad to see you implemented your approach, because that requires to
think it through to the end.  But I can't say much about your new
numbers and whether they look more correct than the current numbers.
Let's postpone the evaluation until we have a clear understanding what
parts need improvement and how we would be able improve them.

> Now, lots and lots might have gone wrong above with my PoC script
> or issues in the algorithm itself so this is why I would like for
> the metrics team to pin point obvious issues with the algorithm and
> maybe a better way to improve it! At least it's out there now :).

Here are the two things in the current algorithm that need improvement
most:

 1. Ideally, we wouldn't have to rely on data to estimate the fraction
of directory requests seen by a relay that is published by relays in
extra-info descriptors and as a result not vetted by the directory
authorities.  It would be much better to only rely on the consensus
(and maybe fallback consensuses) to come up with that fraction.

 2. When combining reports by relays we should extrapolate numbers
similar to how we do it for hidden|onion-service statistics.  That is,
we extrapolate all numbers, remove outliers, and aggregate the
remaining extrapolations to get our result.  Right now we're including
each and every report, and that makes our algorithm less robust
against liars than it should be.  I still have some notes here from a
discussion with Ian for making the extrapolation better, but I didn't
have the chance to implement them.  Maybe this is a good opportunity.

Again, thanks for looking into this!

All the best,
Karsten

> 
> Thanks! David
> 
> [1] https://metrics.torproject.org/userstats-relay-country.html [2]
> https://research.torproject.org/techreports/counting-daily-bridge-users-2012-10-24.pdf
>
> 
[3]
https://gitweb.torproject.org/metrics-web.git/tree/modules/clients/init-userstats.sql
> [4]
> https://gitweb.torproject.org/metrics-web.git/tree/modules/clients/src/org/torproject/metrics/clients/Main.java#n132
>
> 
[5] https://metrics.torproject.org/stats/clients.csv
> 
> 
> 
> _______________________________________________ metrics-team
> mailing list metrics-team at lists.torproject.org 
> https://lists.torproject.org/cgi-bin/mailman/listinfo/metrics-team
> 

-----BEGIN PGP SIGNATURE-----
Comment: GPGTools - http://gpgtools.org

iQEcBAEBAgAGBQJXLJfqAAoJEC3ESO/4X7XB+HwH/1BWSsI5UqJbae50sR7gTTWB
xBA9esgEuMBYH8ZCmhZQd4zTbKplWR1czk3uHdWKZsu7/pJJiF1dPQwdSnTNdsl1
76GoyX/BLYs68fFdmulRINptUWAPwZIMu8gHMbnxD4gReuCN1NC1PU9hKUXj7eDj
//JoN4BouX1YtVth1T4eoKbjY0gDsKdUDPkdMZ7yqWDnyEiksZMGHHdnQx05M0Du
YKmTSNGNbWwqv33HMXV9cw4bK8KUL68nU7xaUMZ2CP1QehH238atgjyTR6rUBOmD
07wnbYQk2tXdFn9+6UsFolYihH5YJVUxEWJUJ5OyW+bSH2zolrsY3N59n09WknY=
=DbEy
-----END PGP SIGNATURE-----