[metrics-team] New way of counting Direct Users per country

David Goulet dgoulet at torproject.org
Thu May 5 16:16:00 UTC 2016


Hi everyone!

(TL;DR; I describe an algorithm for counting direct users using dirreq requests
and ignoring dirreq bytes history.)

Last week, me and armadev wanted to reproduce metrics results that is the daily
number of users per country[1]. We looked at the tech report[2] and the metrics
code[3].

It seems that we are counting direct users (not bridge) using the same
technique as we do with bridge that is counting dirreq and extrapolating using
bytes history. For bridges, it makes sense (see why in the tech report) but for
direct users, statistics comes from relays in the consensus so _maybe_ there is
a better approach of estimating the number of users per-country.

I'll be describing what armadev and I came up with. Maybe it's crazy, maybe
some pieces are missing, maybe it's not at all better then what metrics does.
This is why I'm writing this email, see if all this makes senses.

1) For each relay, we'll compute the BW fraction for the dirreq stats period
   (dirreq-stats-end) for the interval. We make a bandwidth average for that
   period (average of all bw values of a relay in that interval). We then
   divide that value by the total bandwidth at that time in the network:
        R1_bw / (R1_bw + R2_bw + ... + Rn_bw)
   ...where n is the total number of relays in the network.

   We'll split that value in two in case the time period overlaps between two
   days (it actually happens all the time). Here is a great ascii art!! showing
   you the dirreq stats period P between the 4th and 5th of some month:

        4                      5                      6
        +----------|-----------+----------|-----------+
            P: ^_______________.______^

   For the period P, we have 16 hours on the 4th and 8 hours on the 5th so using
   our BW fraction for a relay, we can split that fraction in two fractions for
   each day.

2) For each relay, we count all requests per country using "dirreq-v3-reqs"
   from the extra-info document. Since the period overlap between days, we need
   to split in two as well like step 1). For instance, if we have 32 clients
   for "ao" then on the 4th we have (32 - 4) * (16/24) and on the 5th, we end
   up with (32 - 4) * (8/24).

   (See technical report on why 4 is substracted here[4])

3) At this step, for each relay reporting dirreq-v3-reqs stats, we have a BW
   fraction per day basically the chance of being picked by a client. We also
   have a count per country code of clients seen per day as well.

   For a relay, take the per-country per-day client number, divide it by the bw
   fraction and then divide it by 10 (again see tech report on why but
   basically we estimate a client, over 24h, will do between 8 and 12 directory
   requests). Suming up that value for each relay gives us the final client
   number for that country.

   For the 4th:
        R1: (cc-users[4th] / bw_fraction[4th]) / 10

Relay already have the number of clients they've seen per country so the
approach here is super simple, take advantage of that and extrapolate using the
relay weight during the stats period.

Maybe this is over simplistic, maybe it's been thought out before. However, the
results is an interesting part. For March 5th of 2016, here are the two
estimate for the "de" country, from metrics[5] and this algorithm:

    Metrics: 2016-03-05,relay,de,,,158830,204819,183596,71
        --> 183596 is the number of estimate clients.

    Email: 2016-03-05 - 95745 estimated clients.

As you can see, the difference is almost half! I ran the numbers for other
smaller countries and we are closer to what metrics says usually with countries
< 10k users. For instance Iran "ir":

    Metrics: 2016-03-05,relay,ir,,,5748,7971,7044,71
        --> 7044 is the number of estimate clients

    Email: 2016-03-05 - 5913 estimated clients.

Now, lots and lots might have gone wrong above with my PoC script or issues in
the algorithm itself so this is why I would like for the metrics team to pin
point obvious issues with the algorithm and maybe a better way to improve it!
At least it's out there now :).

Thanks!
David

[1] https://metrics.torproject.org/userstats-relay-country.html
[2] https://research.torproject.org/techreports/counting-daily-bridge-users-2012-10-24.pdf
[3] https://gitweb.torproject.org/metrics-web.git/tree/modules/clients/init-userstats.sql
[4] https://gitweb.torproject.org/metrics-web.git/tree/modules/clients/src/org/torproject/metrics/clients/Main.java#n132
[5] https://metrics.torproject.org/stats/clients.csv
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 603 bytes
Desc: not available
URL: <http://lists.torproject.org/pipermail/metrics-team/attachments/20160505/b3aa0ff1/attachment.sig>


More information about the metrics-team mailing list