[tor-bugs] #19544 [Metrics/Metrics website]: Add graph on bridge users by country and transport

Mon Jul 18 17:55:01 UTC 2016

#19544: Add graph on bridge users by country and transport
-------------------------------------+------------------------------
 Reporter:  karsten                  |          Owner:  karsten
     Type:  enhancement              |         Status:  needs_review
 Priority:  Medium                   |      Milestone:
Component:  Metrics/Metrics website  |        Version:
 Severity:  Normal                   |     Resolution:
 Keywords:                           |  Actual Points:
Parent ID:                           |         Points:
 Reviewer:                           |        Sponsor:
-------------------------------------+------------------------------

Comment (by iwakeh):

 It is a lot of code ...
 Hope the following makes some sense.

 === SQL
 Still looks fine.

 Just two typos, and I'd really like to have explicit column aliases in the
 select-stmts and group-by-clauses.  The latter improves readability and
 safeguards in case of later column reordering.  The former makes the SQL
 independent of postgresql's default column naming scheme.  The following
 diff is just to give an example as I didn't run it on a db:

 {{{
 diff --git a/modules/clients/init-userstats.sql b/modules/clients/init-
 userstats.sql
 index 9d39c5e..a369575 100644
 --- a/modules/clients/init-userstats.sql
 +++ b/modules/clients/init-userstats.sql
 @@ -373,7 +373,7 @@ BEGIN
            WHERE id = last_id;

        -- If the new entry ends right when an existing entry starts and
 -      -- there's no gap between when the previousl processed entry ends
 +      -- there's no gap between when the previously processed entry ends
        -- and when the new entry starts, merge the new entry with the
 other
        -- two entries.  This happens by deleting the previous entry and
        -- expanding the subsequent entry to cover all three entries.
 @@ -538,7 +538,7 @@ BEGIN
      WHERE aggregated.date = aggregated_bytes_responses.date
      AND aggregated.node = aggregated_bytes_responses.node;

 -  -- Update results based on notes reporting responses but no bytes.
 +  -- Update results based on nodes reporting responses but no bytes.
    UPDATE aggregated
      SET nrh = aggregated_responses_bytes.nrh
      FROM (
 @@ -582,7 +582,7 @@ BEGIN
    -- meant to re-compute this query multiple times.
    CREATE TEMPORARY TABLE update2 AS
      SELECT fingerprint, country, transport,
 -           DATE(stats_start), SUM(val) AS val
 +           DATE(stats_start) as stats_date, SUM(val) AS val
      FROM merged
      WHERE node = 'bridge'
      AND metric = 'responses'
 @@ -590,7 +590,7 @@ BEGIN
      -- Note: Comment out the following condition to initialize table!
      AND DATE(stats_start) IN (
          SELECT DISTINCT DATE(stats_start) FROM imported)
 -    GROUP BY 1, 2, 3, 4;
 +    GROUP BY fingerprint, country, transport, stats_date;

    -- Delete all entries from the combined table that we're about to
    -- re-compute.
 @@ -603,7 +603,8 @@ BEGIN
    -- transport.  These response numbers will later be transformed into
    -- user number estimates in the combined view.
    INSERT INTO combined_country_transport
 -    SELECT country.date, country.country, transport.transport,
 +    SELECT country.date as date, country.country as country,
 +           transport.transport as transport,
             SUM(GREATEST(0, transport.val + country.val - total.val))
               AS low,
             SUM(LEAST(transport.val, country.val)) AS high
 @@ -623,7 +624,7 @@ BEGIN
      AND country.fingerprint = transport.fingerprint
      AND country.fingerprint = total.fingerprint
      AND transport.fingerprint = total.fingerprint
 -    GROUP BY 1, 2, 3;
 +    GROUP BY date, country, transport;

    -- We're done combining new data.
    RAISE NOTICE '% Finishing combine step.', timeofday();
 }}}

 The following comments for stmts not touched by this change seem
 confusing, unless I remove the 'of seconds'.
 Does that make sense?

 {{{
 diff --git a/modules/clients/init-userstats.sql b/modules/clients/init-
 userstats.sql
 index 9d39c5e..66edbcc 100644
 --- a/modules/clients/init-userstats.sql
 +++ b/modules/clients/init-userstats.sql
 @@ -119,7 +119,7 @@ CREATE TABLE aggregated (
    -- tech report.
    rrx DOUBLE PRECISION NOT NULL DEFAULT 0,

 -  -- Total number of seconds of nodes reporting responses, possibly
 broken
 +  -- Total number of nodes reporting responses, possibly broken
    -- down by country, transport, or version if either of them is not ''.
    -- This would be referred to as n(R) in the tech report, though it's
 not
    -- used there.
 @@ -128,7 +128,7 @@ CREATE TABLE aggregated (
    -- Total number of reported bytes.  See h(H) in the tech report.
    hh DOUBLE PRECISION NOT NULL DEFAULT 0,

 -  -- Total number of seconds of nodes in the status.  See n(N) in the
 tech
 +  -- Total number of nodes in the status.  See n(N) in the tech
    -- report.
    nn DOUBLE PRECISION NOT NULL DEFAULT 0,

 @@ -136,11 +136,11 @@ CREATE TABLE aggregated (
    -- bytes.  See h(R intersect H) in the tech report.
    hrh DOUBLE PRECISION NOT NULL DEFAULT 0,

 -  -- Number of seconds of nodes reporting bytes.  See n(H) in the tech
 +  -- Number of nodes reporting bytes.  See n(H) in the tech
    -- report.
    nh DOUBLE PRECISION NOT NULL DEFAULT 0,

 -  -- Number of seconds of nodes reporting responses but no bytes.  See
 +  -- Number of nodes reporting responses but no bytes.  See
    -- n(R \ H) in the tech report.
    nrh DOUBLE PRECISION NOT NULL DEFAULT 0
  );
 }}}

 === R
 Why are the top three transports decided by the /difference/ of upper and
 lower bound?
 Or, did I miss-read the script?
 Here the excerpt from the [https://gitweb.torproject.org/karsten/metrics-
 web.git/tree/website/rserve/graphs.R?h=task-19544-2&id=10e4e26c23dd2e1ff402e141f8316f3f9a6fa72f#n935
 R script]:
 {{{
     a <- aggregate(list(mid = u$high - u$low),
                    by = list(transport = u$transport), FUN = sum)
     a <- a[order(a$mid, decreasing = TRUE)[1:top], ]
 }}}
 Shouldn't it be highest upper bound and ties decided by highest lower
 bound?

 === Data explanation
 (reply to comment 7) Maybe something like the following (which can be
 improved):

 '''low:''' Lower bound of users by country and transport, calculated as
 sum over all bridges having reports for the given country and transport,
 i.e., the sum of M(b), where for each bridge '''b''' define '''M(b) :=
 max(0, C(b) + T(b) - S(b))''' using the following definitions:
 * '''C(b)''' is the number of users from a given country reported by
 '''b'''
 * '''T(b)''' is the number of users using a given transport reported by
 '''b'''
 * '''S(b)''' is the total numbers of responses reported by '''b'''
 Reasoning: If the sum C(b) + T(b) exceeds the total number of users from
 all countries and transports '''S(b)''', there must be users from that
 country and transport.  And if that is not the case, 0 is the lower limit.

 '''high:'''  Upper bound of users by country and transport, calculated  as
 sum over all bridges having reports for the given country and transport,
 i.e., the sum of m(b), where for each bridge '''b''' define
 '''m(b):=min(C(b), T(b))''' where we use the definitions from '''low'''
 (above).
 Reasoning: there cannot be more users by country and transport than there
 are users by either of the two numbers.

--
Ticket URL: <https://trac.torproject.org/projects/tor/ticket/19544#comment:8>
Tor Bug Tracker & Wiki <https://trac.torproject.org/>
The Tor Project: anonymity online