[tor-dev] Export BridgeDB usage statistics
teor at riseup.net
Wed Apr 24 02:58:36 UTC 2019
Hi Philipp, Karsten,
> On 24 Apr 2019, at 10:50, Philipp Winter <phw at nymity.ch> wrote:
> I'm working on <https://bugs.torproject.org/9316>, which will make
> BridgeDB export usage statistics. I would like these statistics to be
> public, privacy-preserving, and -- ideally -- added to Tor Metrics. I
> wanted to hear your thoughts on 1) what statistics we should collect,
> 2) how we can collect these statistics safely, and 3) what format these
> statistics should have.
> Broadly speaking, these statistics should answer the following
> * How many requests does BridgeDB see per day?
> * What obfuscation protocols are the most popular?
> * What bridge distribution mechanisms are the most popular?
> * From what countries do we see the most bridge requests?
> * How many BridgeDB requests fail and succeed, respectively?
> * How many requests does BridgeDB see from Yahoo/Gmail/Riseup?
> * How many HTTPS requests are coming from proxies?
> * How many requests are suspicious, and likely issued by bots?
> Each request to BridgeDB carries with it some information, which allows
> us to answer the above questions. I suggest that we collect the
> * The distribution mechanism. Currently, this is HTTPS, email, or
> * The requested transport. Currently this is vanilla, fte, obfs3,
> obfs4, or scramblesuit.
> * The request's origin. For Moat and HTTPS, it's the two-letter
> country code, e.g., IT for Italy. For email, it's the user's email
> domain (Gmail, Yahoo, or Riseup).
> * Whether the request was successful or unsuccessful, i.e., resulted
> in BridgeDB handing out bridges or not.
> * Whether the request was issued by a user or a bot.
> David suggested heuristics that would allow us to estimate if a
> request came from a bot:
> <https://bugs.torproject.org/9316#comment:19> I like these
> suggestions but I'm not sure yet how to encode them -- it's more
> complex than a simple binary flag.
> The combination of these statistics results in ~16,800 buckets (3
> mechanisms * 5 transports * ~280 ISO country codes * 2 success states *
> 2 bot states). We only need to export statistics with non-empty
> buckets. To protect users whose request is the only one in a given
> bucket (e.g., there may be only one user in Turkmenistan who
> successfully requested an FTE bridge over HTTPS on 2019-04-02), we
> should bin the statistics by rounding them up to the next multiple of,
> say, 10. We should further export statistics infrequently -- maybe once
> a day.
> Here's an example of a simple CSV format that takes into account the
> What are your thoughts?
Over the next few months, Nick and I are going to work on
PrivCount for statistics generated by tor relays and bridges.
(I'll be on leave from today until late May.)
We haven't done the detailed design of PrivCount's API yet.
For Tor relay/bridge statistics, we'll have some Rust code
embedded in the tor binary (Data Collectors), which will
add noise, bin, and blind the statistics.
Then we'll have some aggregation servers (Tally Reporters)
which will aggregate and un-blind the results.
If we design the interfaces correctly, we should be able to
re-use the noise and bin code for BridgeDB. (The blinding is
redundant, until we have more than one BridgeDB.)
I imagine we could pass results to a command-line tool for
noise and binning. This tool would also be useful for tests.
(Tests are *so* much simpler when there's no network in the
That way, all of Tor's relay, bridge, and BridgeDB statistics
will noised, binned, and reported in the same way.
I'm not sure if the timeframes will work out though: I'll be
doing the noise and binning when I get back at the end of May.
So we might need to do something quick and dirty until then.
-------------- next part --------------
A non-text attachment was scrubbed...
Size: 833 bytes
Desc: Message signed with OpenPGP
More information about the tor-dev