[metrics-team] Where do the by-transport counts come from in userstats-bridge-combined?

Wed May 31 03:00:59 UTC 2017

On Mon, May 29, 2017 at 10:04:25AM +0200, Karsten Loesing wrote:
> On 24.05.17 23:15, David Fifield wrote:
> > I am referring to https://bugs.torproject.org/19544:
> >> What we could also do as first approximation is find a lower and upper
> >> bound of users by country and transport. The lower bound would
> >> probably be defined as something like max(0, PT + CC - 1) (not just 0
> >> to account for cases where CC > 1 - PT) and the upper bound as min(PT,
> >> CC), even though I could be convinced that other formulas are even
> >> more correct.
> > 
> > I thought I understood this but I guess I do not. Does PT come from
> > dirreq-v3-reqs and CC from bridge-ip-transports? That wouldn't make
> > sense to me, because they are measuring different things. Or is it that
> > CC is still using bridge-ips (I don't know the current status of that;
> > see https://bugs.torproject.org/18167).
> 
> We're still using dirreq-v3-reqs and either one of the bridge-ip* lines
> combined to get the number of requests per country or transport or IP
> version.

Okay--I think I see. It's as covered in Section 5 "Breaking down to user
numbers by country" in
https://research.torproject.org/techreports/counting-daily-bridge-users-2012-10-24.pdf
	"We sum up unique IP addresses and calculate a fraction of IP
	addresses for every country and day."

So if I understand correctly, suppose we had
	dirreq-v3-resp ok=96,not-enough-sigs=0,unavailable=0,not-found=0,not-modified=8,busy=0
	bridge-ips aa=24,bb=24,cc=24,dd=24
	bridge-ip-transports obfs3=8,obfs4=32
	bridge-ip-versions v4=8,v6=8
Then of the bridge's 96 total responses, we would say that
	25% (24/96) were from country aa, 25% from bb, 25% from cc, 25% from dd
	20% (8/40) were using obfs3, 80% obfs4
	50% (8/16) were using IPv4, 50% using IPv6
In other words, the by-transport and by-version number of responses are
assumed to be proportional to the corresponding number of unique IP
addresses.

When you say you are "still" using dirreq-v3-reqs and either one of the
bridge-ip* lines, is that because there now exists dirreq-v3-reqs, which
breaks down the countries by number of directory requests, rather than
number of unique IP addresses? (I.e., the subject of #18167.) If I'm not
mistaken, there's no counterpart to dirreq-v3-reqs for transport and IP
version, so even if dirreq-v3-reqs were used for countries, it would
still be necessary to combine dirreq-v3-resp and bridge-ip-transports or
bridge-ip-versions for transports and IP versions.

> If you think the result will be interesting for Metrics website
> visitors, would you want to start working on a similar patch?

My immediate goal is just to be able to compare the IPv4 and IPv6 usage
of a single bridge, one of the default obfs4 bridges, the only one that
has an IPv6 address: https://bugs.torproject.org/22429. I was originally
going to ask the operator to use a separate fingerprint for IPv4 and
IPv6, to make it easier, but then I thought that it would be possible to
get bounds using the existing statistics. It looks like for this, all I
have to do is look at the ratio of v4 and v6 in bridge-ip-versions.

And, it looks like Onionoo already does what I was thinking of:
https://onionoo.torproject.org/clients?fingerprint=D9C805C955CB124D188C0D44F271E9BE57DE2109
	"versions":{"v4":0.9999944}