On Sat, Apr 16, 2022 at 06:16:23PM -0600, David Fifield wrote:
I am trying to reproduce the "frac" computation from the Reproducible Metrics instructions: https://metrics.torproject.org/reproducible-metrics.html#relay-users Which is also Section 3 in the tech report on counting bridge users: https://research.torproject.org/techreports/counting-daily-bridge-users-2012...
h(R^H) * n(H) + h(H) * n(R\H)
frac = ----------------------------- h(H) * n(N)
My minor goal is to reproduce the "frac" column from the Metrics web site (which I assume is the same as the frac above, expressed as a percentage):
https://metrics.torproject.org/userstats-relay-country.csv?start=2022-04-01&... date,country,users,lower,upper,frac 2022-04-01,,2262557,,,92 2022-04-02,,2181639,,,92 2022-04-03,,2179544,,,93 2022-04-04,,2350360,,,93 2022-04-05,,2388772,,,93 2022-04-06,,2356170,,,93 2022-04-07,,2323184,,,93 2022-04-08,,2310170,,,91
I'm having trouble with the computation of n(R\H) and h(R∧H). I understand that R is the subset of relays that report directory request counts (i.e. that have dirreq-stats-end in their extra-info descriptors) and H is the subset of relays that report directory request byte counts (i.e. that have dirreq-write-history in their extra-info descriptors). R and H partially overlap: there are relays that are in R but not H, others that are in H but not R, and others that are in both.
The computations depend on some values that are directly from descriptors: n(R) = sum of hours, for relays with directory request counts n(H) = sum of hours, for relays with directory write histories h(H) = sum of written bytes, for relays with directory write histories
...
Using the formulas and assumptions above, here's my attempt at computing recent "frac" values:
date `n(N)` `n(H)` `h(H)` `n(R)` `n(R\H)` `h(R∧H)` frac 2022-04-01 166584 177638. 2.24e13 125491. 0 1.59e13 0.753 2022-04-02 166951 177466. 2.18e13 125686. 0 1.54e13 0.753 2022-04-03 167100 177718. 2.27e13 127008. 0 1.62e13 0.760 2022-04-04 166970 177559. 2.43e13 126412. 0 1.73e13 0.757 2022-04-05 166729 177585. 2.44e13 125389. 0 1.72e13 0.752 2022-04-06 166832 177470. 2.39e13 127077. 0 1.71e13 0.762 2022-04-07 166532 177210. 2.48e13 127815. 0 1.79e13 0.768 2022-04-08 167695 176879. 2.52e13 127697. 0 1.82e13 0.761
I tried computing n(R\H) and h(R∧H) from the definitions, rather than by using the formulas in the Reproducible Metrics guide. This achieves an almost matching "frac" column, though it is still about 1% too high.
date `n(N)` `n(H)` `h(H)` `n(R)` `n(R\H)` `h(R∧H)` frac 2022-04-01 166584 177638. 2.24e13 125491. 90.9 1.96e13 0.930 2022-04-02 166951 177466. 2.18e13 125686. 181. 1.92e13 0.937 2022-04-03 167100 177718. 2.27e13 127008. 154. 2.00e13 0.942 2022-04-04 166970 177559. 2.43e13 126412. 134. 2.14e13 0.936 2022-04-05 166729 177585. 2.44e13 125389. 94.6 2.15e13 0.938 2022-04-06 166832 177470. 2.39e13 127077. 162. 2.11e13 0.940 2022-04-07 166532 177210. 2.48e13 127815. 102. 2.18e13 0.938 2022-04-08 167695 176879. 2.52e13 127697. 158. 2.21e13 0.926
I got this by taking an explicit set intersection between the R and H time intervals. So, for example, if the intervals making up n(R) and n(H) are (with their lengths):
n(R) [---10---] [----12----] [---9---] n(H) [----12----] [------16------] [--7--]
Then the intersection n(R∧H) is:
n(R∧H) [-5-] [-5-] [3] [3]
h(R∧H) comes pro-rating the n(H) intervals, each of which is associated with an h(H) byte count). Suppose the [----12----] interval represents 1000 bytes. Then each of the [-5-] intervals that result from it in the intersection are worth 5/12 × 1000 = 417 bytes.
We get n(R\H) from n(R) − n(R∧H):
n(R\H) [-5-] [4-] [-6--]
This seems overall more correct, though it required a more elaborate computation than the Reproducible Metrics guide prescribes. I'm still not sure why it does not match exactly, and I would still appreciate a pointer to where Tor Metrics does the "frac" computation.
I was initially interested in this for the purpose of better estimating the number of Snowflake users. But now I've decided "frac" is not useful for that purpose: since there is only one bridge we care about, it does not make sense to adjust the numbers to account for other bridges that may not report the same set of statistics. I don't plan to take this investigation any further for the time being, but here is source code to reproduce the above tables. You will need: https://collector.torproject.org/archive/relay-descriptors/consensuses/conse... https://collector.torproject.org/archive/relay-descriptors/extra-infos/extra...
./relay_uptime.py consensuses-2022-04.tar.xz > relay_uptime.csv ./relay_dir.py extra-infos-2022-04.tar.xz > relay_dir.csv ./frac.py relay_uptime.csv relay_dir.csv