On 17/4/22 2:16, David Fifield wrote:
I am trying to reproduce the "frac" computation from the Reproducible Metrics instructions: https://metrics.torproject.org/reproducible-metrics.html#relay-users Which is also Section 3 in the tech report on counting bridge users: https://research.torproject.org/techreports/counting-daily-bridge-users-2012...
h(R^H) * n(H) + h(H) * n(R\H)
frac = ----------------------------- h(H) * n(N)
My minor goal is to reproduce the "frac" column from the Metrics web site (which I assume is the same as the frac above, expressed as a percentage):
https://metrics.torproject.org/userstats-relay-country.csv?start=2022-04-01&... date,country,users,lower,upper,frac 2022-04-01,,2262557,,,92 2022-04-02,,2181639,,,92 2022-04-03,,2179544,,,93 2022-04-04,,2350360,,,93 2022-04-05,,2388772,,,93 2022-04-06,,2356170,,,93 2022-04-07,,2323184,,,93 2022-04-08,,2310170,,,91
I'm having trouble with the computation of n(R\H) and h(R∧H). I understand that R is the subset of relays that report directory request counts (i.e. that have dirreq-stats-end in their extra-info descriptors) and H is the subset of relays that report directory request byte counts (i.e. that have dirreq-write-history in their extra-info descriptors). R and H partially overlap: there are relays that are in R but not H, others that are in H but not R, and others that are in both.
The computations depend on some values that are directly from descriptors: n(R) = sum of hours, for relays with directory request counts n(H) = sum of hours, for relays with directory write histories h(H) = sum of written bytes, for relays with directory write histories
Compute n(R\H) as the number of hours for which responses have been reported but no written directory bytes. This fraction is determined by summing up all interval lengths and then subtracting the written directory bytes interval length from the directory response interval length. Negative results are discarded.
I interpret this to mean: add up all the dirrect-stats-end intervals (this is n(R)), add up all the dirreq-write-history intervals (this is n(H)), and compute n(R\H) as n(R) − n(H). This seems wrong: it would only be true when H is a subset of R.
Compute h(R∧H) as the number of written directory bytes for the fraction of time when a server was reporting both written directory bytes and directory responses. As above, this fraction is determined by first summing up all interval lengths and then computing the minimum of both sums divided by the sum of reported written directory bytes.
This seems to be saying to compute h(R∧H) (a count of bytes) as min(n(R), n(H)) / h(H). This is dimensionally wrong: the units are hours / bytes. What would be more natural to me is min(n(R), n(H)) / max(n(R), n(H)) × h(H); i.e., divide the smaller of n(R) and n(R) by the larger, then multiply this ratio by the observable byte count. But this, too, only works when H is a subset of R.
Where is this computation done in the metrics code? I would like to refer to it, but I could not find it.
Using the formulas and assumptions above, here's my attempt at computing recent "frac" values:
date `n(N)` `n(H)` `h(H)` `n(R)` `n(R\H)` `h(R∧H)` frac 2022-04-01 166584 177638. 2.24e13 125491. 0 1.59e13 0.753 2022-04-02 166951 177466. 2.18e13 125686. 0 1.54e13 0.753 2022-04-03 167100 177718. 2.27e13 127008. 0 1.62e13 0.760 2022-04-04 166970 177559. 2.43e13 126412. 0 1.73e13 0.757 2022-04-05 166729 177585. 2.44e13 125389. 0 1.72e13 0.752 2022-04-06 166832 177470. 2.39e13 127077. 0 1.71e13 0.762 2022-04-07 166532 177210. 2.48e13 127815. 0 1.79e13 0.768 2022-04-08 167695 176879. 2.52e13 127697. 0 1.82e13 0.761
The "frac" column does not match the CSV. Also notice that n(N) < n(H), which should be impossible because H is supposed to be a subset of N (N is the set of all relays). But this is what I get when I estimate n(N) from a network-status-consensus-3 and n(H) from extra-info documents. Also notice that n(R) < n(H), which means that H cannot be a subset of R, contrary to the observations above.
Hi David,
These computations are a bit hidden in metrics code. Specifically these are in the website repository but in the sql init scripts.
This is the view that is responsible for computing the data that are then published in the csv:
https://gitlab.torproject.org/tpo/network-health/metrics/website/-/blob/mast...
Personally I am not sure what was the rationale behind this. I will try to go through the SQL myself and the reproducible metrics page and give you an answer.
Meanwhile I have opened an issue to track this: https://gitlab.torproject.org/tpo/network-health/analysis/-/issues/35
tor-dev mailing list tor-dev@lists.torproject.org https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-dev