[tor-commits] [metrics-web/master] Estimate bridge users by country based on requests.

karsten at torproject.org karsten at torproject.org
Sun Apr 12 13:43:46 UTC 2020


commit 999874057e462a99885e77a380e93bf0b23a3e1d
Author: Karsten Loesing <karsten.loesing at gmx.net>
Date:   Wed Mar 25 17:28:54 2020 +0100

    Estimate bridge users by country based on requests.
    
    Estimate bridge users by country based on requests by country, if
    available, to get more accurate numbers than those obtained from
    unique IP address counts.
    
    Fixes #18167.
---
 CHANGELOG.md                                          |  3 +++
 .../org/torproject/metrics/stats/clients/Main.java    |  7 +++++--
 src/main/resources/web/jsps/reproducible-metrics.jsp  | 19 ++++++++++++-------
 3 files changed, 20 insertions(+), 9 deletions(-)

diff --git a/CHANGELOG.md b/CHANGELOG.md
index fa77766..817aafa 100644
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@@ -10,6 +10,9 @@
    - Estimate relay users by country based on responses to directory
      requests to reduce the overall effect of binning and to make
      relay and bridge user estimates more comparable.
+   - Estimate bridge users by country based on requests by country, if
+     available, to get more accurate numbers than those obtained from
+     unique IP address counts.
 
  * Minor changes
    - Make Jetty host configurable.
diff --git a/src/main/java/org/torproject/metrics/stats/clients/Main.java b/src/main/java/org/torproject/metrics/stats/clients/Main.java
index 9dc7d8c..f26250d 100644
--- a/src/main/java/org/torproject/metrics/stats/clients/Main.java
+++ b/src/main/java/org/torproject/metrics/stats/clients/Main.java
@@ -260,6 +260,7 @@ public class Main {
     parseBridgeDirreqV3Resp(fingerprint, publishedMillis,
         dirreqStatsEndMillis, dirreqStatsIntervalLengthMillis,
         descriptor.getDirreqV3Resp(),
+        descriptor.getDirreqV3Reqs(),
         descriptor.getBridgeIps(),
         descriptor.getBridgeIpTransports(),
         descriptor.getBridgeIpVersions());
@@ -272,6 +273,7 @@ public class Main {
       long publishedMillis, long dirreqStatsEndMillis,
       long dirreqStatsIntervalLengthMillis,
       SortedMap<String, Integer> responses,
+      SortedMap<String, Integer> requests,
       SortedMap<String, Integer> bridgeIps,
       SortedMap<String, Integer> bridgeIpTransports,
       SortedMap<String, Integer> bridgeIpVersions) throws SQLException {
@@ -301,7 +303,8 @@ public class Main {
         database.insertIntoImported(fingerprint, "bridge", "responses", "", "",
             "", fromMillis, toMillis, resp * intervalFraction);
         parseBridgeRespByCategory(fingerprint, fromMillis, toMillis, resp,
-            dirreqStatsIntervalLengthMillis, "country", bridgeIps);
+            dirreqStatsIntervalLengthMillis, "country",
+            null != requests ? requests : bridgeIps);
         parseBridgeRespByCategory(fingerprint, fromMillis, toMillis, resp,
             dirreqStatsIntervalLengthMillis, "transport",
             bridgeIpTransports);
@@ -331,7 +334,7 @@ public class Main {
     /* If we're not told any frequencies, or at least none of them are
      * greater than 4, put in a default that we'll attribute all responses
      * to. */
-    if (total == 0) {
+    if (frequenciesCopy.isEmpty()) {
       switch (category) {
         case "country":
           frequenciesCopy.put("??", 4.0);
diff --git a/src/main/resources/web/jsps/reproducible-metrics.jsp b/src/main/resources/web/jsps/reproducible-metrics.jsp
index 209cb5b..922e458 100644
--- a/src/main/resources/web/jsps/reproducible-metrics.jsp
+++ b/src/main/resources/web/jsps/reproducible-metrics.jsp
@@ -198,10 +198,11 @@ As above, refer to the <a href="/bridge-descriptors.html">Tor bridge descriptors
 
 <p>Parse the <code>"dirreq-write-history"</code> line containing written bytes spent on answering directory requests. If the contained statistics end time is more than 1 week older than the descriptor publication time in the <code>"published"</code> line, skip this line to avoid including statistics in the aggregation that have very likely been reported in earlier descriptors and processed before. If a statistics interval spans more than 1 UTC date, split observations to the covered UTC dates by assuming a linear distribution of observations.</p>
 
-<p>Parse the <code>"dirreq-stats-end"</code> and <code>"dirreq-v3-resp"</code> lines containing directory-request statistics.
+<p>Parse the <code>"dirreq-stats-end"</code>, <code>"dirreq-v3-resp"</code>, and <code>"dirreq-v3-reqs"</code> lines containing directory-request statistics.
 If the statistics end time in the <code>"dirreq-stats-end"</code> line is more than 1 week older than the descriptor publication time in the <code>"published"</code> line, skip these directory request statistics for the same reason as given above: to avoid including statistics in the aggregation that have very likely been reported in earlier descriptors and processed before.
 Also skip statistics with an interval length other than 1 day.
-Parse successful requests from the <code>"ok"</code> part of the <code>"dirreq-v3-resp"</code> line. Subtract <code>4</code> to undo the binning operation that has been applied by the bridge. Discard the resulting number if it's zero or negative.
+Parse successful requests from the <code>"ok"</code> part of the <code>"dirreq-v3-resp"</code> line, subtract <code>4</code> to undo the binning operation that has been applied by the bridge, and discard the resulting number if it's zero or negative.
+Parse successful requests by country from the <code>"dirreq-v3-reqs"</code> line, subtract <code>4</code> from each number to undo the binning operation that has been applied by the bridge, and discard the resulting number if it's zero or negative.
 Split observations to the covered UTC dates by assuming a linear distribution of observations.</p>
 
 <p>Parse the <code>"bridge-ips"</code>, <code>"bridge-ip-versions"</code>, and <code>"bridge-ip-transports"</code> lines containing unique connecting IP addresses by country, IP version, and transport. From each number of unique IP addresses, subtract 4 to undo the binning operation that has been applied by the bridge. Discard the resulting number if it's zero or negative.</p>
@@ -210,9 +211,15 @@ Split observations to the covered UTC dates by assuming a linear distribution of
 
 <h4>Step 3: Approximate directory requests by country, transport, and IP version</h4>
 
-<p>Bridges, unlike relays, do not report directory request numbers by country, transport, or IP version.
-However, bridges do report unique IP address counts by country, by transport, and by IP version.
-We approximate directory request numbers by multiplying the fraction of unique IP addresses from a given country, transport, or IP version with the total number of successful requests.</p>
+<p>Older bridges did not report directory requests by country but only total requests and unique IP address counts by country.
+In that case we approximate directory requests by country by multiplying the total number with the fraction of unique IP addresses from a given country.
+For newer bridges that do report directory requests by country we still take total requests as starting point and multiply with the fraction of requests by country.
+Otherwise, if we had used directory requests by country directly, totals by country, transport, and IP version would not match.
+If a bridge reports neither directory requests by country nor unique IP addresses by country, we attribute all requests to "??" which stands for Unknown Country.</p>
+
+<p>Bridges do not report directory requests by transport or IP version.
+We approximate these numbers by multiplying the total number of requests with the fraction of unique IP addresses by transport or IP version.
+If a bridge does not report unique IP addresses by transport or IP version, we attribute all requests to the default onion-routing protocol or to IPv4, respectively.</p>
 
 <p>As a special case, we also approximate lower and upper bounds for directory requests by country <em>and</em> transport.
 This approximation is based on the fact that most bridges only provide a small number of transports.
@@ -223,8 +230,6 @@ This allows us to combine unique IP address sets by country and by transport and
 <li>We calculate the upper bound as <code>min(C(b), T(b))</code> with the definitions from above. Reasoning: There cannot be more requests by country and transport than there are requests by either of the two numbers.
 </ul>
 
-<p>If a bridge does not report unique IP addresses by country, transport, or IP version, we attribute all requests to "??" which stands for Unknown Country, to the default onion-routing protocol, or to IPv4.</p>
-
 <h4>Step 4: Estimate fraction of reported directory-request statistics</h4>
 
 <p>The step for estimating the fraction of reported directory-request statistics is pretty much the same for bridges and for relays.





More information about the tor-commits mailing list