[tor-dev] Export BridgeDB usage statistics

Wed Apr 24 00:50:02 UTC 2019

Hi Karsten,

I'm working on <https://bugs.torproject.org/9316>, which will make
BridgeDB export usage statistics.  I would like these statistics to be
public, privacy-preserving, and -- ideally -- added to Tor Metrics.  I
wanted to hear your thoughts on 1) what statistics we should collect,
2) how we can collect these statistics safely, and 3) what format these
statistics should have.

Broadly speaking, these statistics should answer the following
questions:

  * How many requests does BridgeDB see per day?
  * What obfuscation protocols are the most popular?
  * What bridge distribution mechanisms are the most popular?
  * From what countries do we see the most bridge requests?
  * How many BridgeDB requests fail and succeed, respectively?
  * How many requests does BridgeDB see from Yahoo/Gmail/Riseup?
  * How many HTTPS requests are coming from proxies?
  * How many requests are suspicious, and likely issued by bots?

Each request to BridgeDB carries with it some information, which allows
us to answer the above questions.  I suggest that we collect the
following:

  * The distribution mechanism.  Currently, this is HTTPS, email, or
    Moat.

  * The requested transport.  Currently this is vanilla, fte, obfs3,
    obfs4, or scramblesuit.

  * The request's origin.  For Moat and HTTPS, it's the two-letter
    country code, e.g., IT for Italy.  For email, it's the user's email
    domain (Gmail, Yahoo, or Riseup).

  * Whether the request was successful or unsuccessful, i.e., resulted
    in BridgeDB handing out bridges or not.

  * Whether the request was issued by a user or a bot.
    David suggested heuristics that would allow us to estimate if a
    request came from a bot:
    <https://bugs.torproject.org/9316#comment:19> I like these
    suggestions but I'm not sure yet how to encode them -- it's more
    complex than a simple binary flag.

The combination of these statistics results in ~16,800 buckets (3
mechanisms * 5 transports * ~280 ISO country codes * 2 success states *
2 bot states).  We only need to export statistics with non-empty
buckets.  To protect users whose request is the only one in a given
bucket (e.g., there may be only one user in Turkmenistan who
successfully requested an FTE bridge over HTTPS on 2019-04-02), we
should bin the statistics by rounding them up to the next multiple of,
say, 10.  We should further export statistics infrequently -- maybe once
a day.

Here's an example of a simple CSV format that takes into account the
above:

  timestamp,mechanism,transport,country|domain,success,count,origin
  1555977600,https,vanilla,it,successful,40,user
  1555977600,https,obfs4,ca,unsuccessful,10,user
  1555977600,email,vanilla,yahoo.com,successful,50,user
  ...

What are your thoughts?

Thanks,
Philipp