[or-cvs] r14578: Add proposed methodolody for tracking national usage trends. (tor/trunk/doc/spec/proposals/ideas)

nickm at seul.org nickm at seul.org
Thu May 8 04:13:37 UTC 2008


Author: nickm
Date: 2008-05-08 00:13:36 -0400 (Thu, 08 May 2008)
New Revision: 14578

Added:
   tor/trunk/doc/spec/proposals/ideas/xxx-geoip-survey-plan.txt
Log:
Add proposed methodolody for tracking national usage trends.

Added: tor/trunk/doc/spec/proposals/ideas/xxx-geoip-survey-plan.txt
===================================================================
--- tor/trunk/doc/spec/proposals/ideas/xxx-geoip-survey-plan.txt	                        (rev 0)
+++ tor/trunk/doc/spec/proposals/ideas/xxx-geoip-survey-plan.txt	2008-05-08 04:13:36 UTC (rev 14578)
@@ -0,0 +1,88 @@
+
+
+Abstract
+
+   This document explains how to tell about how many Tor users there
+   are, and how many there are in which country.  Statistics are
+   involved.
+
+Motivation
+
+   There are a few reasons we need to keep track of which countries
+   Tor users (in aggregate) are coming from:
+
+      - Resource allocation.  Knowing about underserved countries with
+        lots of users can let us know about where we need to direct
+        translation and outreach efforts.
+
+      - Anticensorship.  Sudden drops in usage on a national basis can
+        indicate the arrival of a censorious firewall.
+
+      - Sponsor outreach and self-evalutation.  Many people and
+        organizations who are interested in funding The Tor Project's
+        work want to know that we're successfully serving parts of the
+        world they're interested in, and that efforts to expand our
+        userbase are actually succeeding.  So, when you come right
+        down to it, do we.
+
+Goals
+
+   We want to know about how many Tor users there are, and which
+   countries they're in, even in the presence of a hypothetical
+   "directory guard" feature.  Some uncertainty is okay, but we'd like
+   to be able to put a bound on the uncertainty.
+
+   We need to make sure this information isn't exposed in a way that
+   helps an adversary.
+
+Methods:
+
+   Every client downloads network status documents.  There are
+   currently three methods (one hypothetical) for clients to get them.
+      - 0.1.2.x clients (and earlier) fetch a v2 networkstatus
+        document about every NETWORKSTATUS_CLIENT_DL_INTERVAL [30
+        minutes].
+
+      - 0.2.0.x clients fetch a v3 networkstatus consensus document
+        at a random interval between when their current document is no
+        longer freshest, and when their current document is about to
+        expire.
+
+        [In both of the above cases, clients choose a directory cache at
+        random with odds roughly proportional to its bandwidth.]
+
+      - In some future version, clients will choose directory caches
+        to serve as their "directory guards" to avoid profiling
+        attacks, similarly to how clients currently start all their
+        circuits at guard nodes.
+
+    We assume that a directory cache can tell which of these three
+    categories a client is in by the format of its status request.
+
+    A directory cache can be made to count distinct client IP
+    addresses that make a certain request of it in a given timeframe.
+    For the first two cases, a cache can get a picture of the overall
+    number and countries of users in the network by dividing the IP
+    count by the probability with which they (as a cache) would be
+    chosen.  Assuming that our listed bandwidth is such that we expect
+    to be chosen with probability P for any given request, and we've
+    been counting IPs for long enough that we expect the average
+    client to have made N requests, they will have visited us at least
+    once with probability P' = 1-(1-P)^N, and so we divide the IP
+    counts we've seen by P' for our estimate.
+
+    If directory guards are in use, directory guards get a picture of
+    all those users who chose them as a guard when they were listed
+    as a good choice for a guard, and who are also on the network
+    now.  The cleanest data here will come from nodes that were listed
+    as good new-guards choices for a while, and have not been so for a
+    while longer (to study decay rates); nodes that have been listed
+    as good new-guard choices consistently for a long time (to get a
+    sample of the network); and nodes that have been listed as good
+    new-guard choices only recently (to get a sample of new users and
+    users whose guards have died out.)
+
+    Note that these measurements *shouldn't* be taken at directory
+    authorities: their picture of the network is too skewed by the
+    special cases in which clients fetch from them directly.
+



More information about the tor-commits mailing list