[tor-commits] [metrics-web/master] Retire old user number estimates.

karsten at torproject.org karsten at torproject.org
Mon Oct 28 14:45:10 UTC 2013


commit b9ce7127ccb722bfe2a368450ba4569d68cd11e3
Author: Karsten Loesing <karsten.loesing at gmx.net>
Date:   Mon Oct 28 15:44:37 2013 +0100

    Retire old user number estimates.
---
 web/WEB-INF/users.jsp |  407 +++++++++++++++++--------------------------------
 1 file changed, 143 insertions(+), 264 deletions(-)

diff --git a/web/WEB-INF/users.jsp b/web/WEB-INF/users.jsp
index 06d7f4a..788b3a8 100644
--- a/web/WEB-INF/users.jsp
+++ b/web/WEB-INF/users.jsp
@@ -16,238 +16,11 @@
 <h2>Tor Metrics Portal: Users</h2>
 <br>
 
-<a name="direct-users"></a>
-<h3><a href="#direct-users" class="anchor">Directly connecting Tor
-users</a></h3>
-<br>
-<p>After being connected to the Tor network, users need to refresh their
-list of running relays on a regular basis. They send their requests to one
-out of a few hundred directory mirrors to save bandwidth of the directory
-authorities. The following graphs show an estimate of recurring Tor users
-based on the requests seen by a few dozen directory mirrors.</p>
-<p><b>Daily directly connecting users:</b></p>
-<img src="direct-users.png${direct_users_url}"
-     width="576" height="360" alt="Direct users graph">
-<form action="users.html#direct-users">
-  <div class="formrow">
-    <input type="hidden" name="graph" value="direct-users">
-    <p>
-    <label>Start date (yyyy-mm-dd):</label>
-      <input type="text" name="start" size="10"
-             value="<c:choose><c:when test="${fn:length(direct_users_start) == 0}">${default_start_date}</c:when><c:otherwise>${direct_users_start[0]}</c:otherwise></c:choose>">
-    <label>End date (yyyy-mm-dd):</label>
-      <input type="text" name="end" size="10"
-             value="<c:choose><c:when test="${fn:length(direct_users_end) == 0}">${default_end_date}</c:when><c:otherwise>${direct_users_end[0]}</c:otherwise></c:choose>">
-    </p><p>
-      Source: <select name="country">
-        <option value="all"<c:if test="${direct_users_country[0] eq 'all'}"> selected</c:if>>All users</option>
-        <c:forEach var="country" items="${countries}" >
-          <option value="${country[0]}"<c:if test="${direct_users_country[0] eq country[0]}"> selected</c:if>>${country[1]}</option>
-        </c:forEach>
-      </select>
-    </p><p>
-      Show possible censorship events if available (<a
-      href="http://research.torproject.org/techreports/detector-2011-09-09.pdf">BETA</a>)
-      <select name="events">
-        <option value="off">Off</option>
-        <option value="on"<c:if test="${direct_users_events[0] eq 'on'}"> selected</c:if>>On: both points and expected range</option>
-        <option value="points"<c:if test="${direct_users_events[0] eq 'points'}"> selected</c:if>>On: points only, no expected range</option>
-      </select>
-    </p><p>
-    <input class="submit" type="submit" value="Update graph">
-    </p>
-  </div>
-</form>
-<p>Download graph as
-<a href="direct-users.pdf${direct_users_url}">PDF</a> or
-<a href="direct-users.svg${direct_users_url}">SVG</a>.</p>
-<hr>
-<a name="direct-users-table"></a>
-<p><b>Top-10 countries by directly connecting users:</b></p>
-<form action="users.html#direct-users-table">
-  <div class="formrow">
-    <input type="hidden" name="table" value="direct-users">
-    <p>
-    <label>Start date (yyyy-mm-dd):</label>
-      <input type="text" name="start" size="10"
-             value="<c:choose><c:when test="${fn:length(direct_users_start) == 0}">${default_start_date}</c:when><c:otherwise>${direct_users_start[0]}</c:otherwise></c:choose>">
-    <label>End date (yyyy-mm-dd):</label>
-      <input type="text" name="end" size="10"
-             value="<c:choose><c:when test="${fn:length(direct_users_end) == 0}">${default_end_date}</c:when><c:otherwise>${direct_users_end[0]}</c:otherwise></c:choose>">
-    </p><p>
-    <input class="submit" type="submit" value="Update table">
-    </p>
-  </div>
-</form>
-<br>
-<table>
-  <tr>
-    <th>Country</th>
-    <th>Mean daily users</th>
-  </tr>
-  <c:forEach var="row" items="${direct_users_tabledata}">
-    <tr>
-      <td><a href="users.html?graph=direct-users&country=${row['cc']}#direct-users">${row['country']}</a> </td>
-      <td>${row['abs']} (<fmt:formatNumber type="number" minFractionDigits="2" value="${row['rel']}" /> %)</td>
-    </tr>
-  </c:forEach>
-</table>
-<hr>
-<a name="censorship-events"></a>
-<p><b>Top-10 countries by possible censorship events (<a
-      href="http://research.torproject.org/techreports/detector-2011-09-09.pdf">BETA</a>):</b></p>
-<form action="users.html#censorship-events">
-  <div class="formrow">
-    <input type="hidden" name="table" value="censorship-events">
-    <p>
-    <label>Start date (yyyy-mm-dd):</label>
-      <input type="text" name="start" size="10"
-             value="<c:choose><c:when test="${fn:length(censorship_events_start) == 0}">${default_start_date}</c:when><c:otherwise>${censorship_events_start[0]}</c:otherwise></c:choose>">
-    <label>End date (yyyy-mm-dd):</label>
-      <input type="text" name="end" size="10"
-             value="<c:choose><c:when test="${fn:length(censorship_events_end) == 0}">${default_end_date}</c:when><c:otherwise>${censorship_events_end[0]}</c:otherwise></c:choose>">
-    </p><p>
-    <input class="submit" type="submit" value="Update table">
-    </p>
-  </div>
-</form>
-<br>
-<table>
-  <tr>
-    <th>Country</th>
-    <th>Downturns</th>
-    <th>Upturns</th>
-  </tr>
-  <c:forEach var="row" items="${censorship_events_tabledata}">
-    <tr>
-      <td><a href="users.html?graph=direct-users&country=${row['cc']}&events=on#direct-users">${row['country']}</a> </td>
-      <td>${row['downturns']}</td>
-      <td>${row['upturns']}</td>
-    </tr>
-  </c:forEach>
-</table>
-<hr>
-<p><a href="csv/direct-users.csv">CSV</a> file containing daily directly
-connecting users by country.</p>
-<p><a href="csv/monthly-users-peak.csv">CSV</a> file containing peak daily
-Tor users (direct and bridge) per month by country.</p>
-<p><a href="csv/monthly-users-average.csv">CSV</a> file containing average
-daily Tor users (direct and bridge) per month by country.</p>
-<br>
-
-<a name="bridge-users"></a>
-<h3><a href="#bridge-users" class="anchor">Tor users via bridges</a></h3>
-<br>
-<p>Users who cannot connect directly to the Tor network instead connect
-via bridges, which are non-public relays. The following graphs display an
-estimate of Tor users via bridges based on the unique IP addresses as seen
-by a few hundred bridges.</p>
-<img src="bridge-users.png${bridge_users_url}"
-     width="576" height="360" alt="Bridge users graph">
-<form action="users.html#bridge-users">
-  <div class="formrow">
-    <input type="hidden" name="graph" value="bridge-users">
-    <p>
-    <label>Start date (yyyy-mm-dd):</label>
-      <input type="text" name="start" size="10"
-             value="<c:choose><c:when test="${fn:length(bridge_users_start) == 0}">${default_start_date}</c:when><c:otherwise>${bridge_users_start[0]}</c:otherwise></c:choose>">
-    <label>End date (yyyy-mm-dd):</label>
-      <input type="text" name="end" size="10"
-             value="<c:choose><c:when test="${fn:length(bridge_users_end) == 0}">${default_end_date}</c:when><c:otherwise>${bridge_users_end[0]}</c:otherwise></c:choose>">
-    </p><p>
-      Source: <select name="country">
-        <option value="all"<c:if test="${bridge_users_country[0] eq 'all'}"> selected</c:if>>All users</option>
-        <c:forEach var="country" items="${countries}" >
-          <option value="${country[0]}"<c:if test="${bridge_users_country[0] eq country[0]}"> selected</c:if>>${country[1]}</option>
-        </c:forEach>
-      </select>
-    </p><p>
-    <input class="submit" type="submit" value="Update graph">
-    </p>
-  </div>
-</form>
-<p>Download graph as
-<a href="bridge-users.pdf${bridge_users_url}">PDF</a> or
-<a href="bridge-users.svg${bridge_users_url}">SVG</a>.</p>
-<hr>
-<a name="bridge-users-table"></a>
-<p><b>Top-10 countries by bridge users:</b></p>
-<form action="users.html#bridge-users-table">
-  <div class="formrow">
-    <input type="hidden" name="table" value="bridge-users">
-    <p>
-    <label>Start date (yyyy-mm-dd):</label>
-      <input type="text" name="start" size="10"
-             value="<c:choose><c:when test="${fn:length(bridge_users_start) == 0}">${default_start_date}</c:when><c:otherwise>${bridge_users_start[0]}</c:otherwise></c:choose>">
-    <label>End date (yyyy-mm-dd):</label>
-      <input type="text" name="end" size="10"
-             value="<c:choose><c:when test="${fn:length(bridge_users_end) == 0}">${default_end_date}</c:when><c:otherwise>${bridge_users_end[0]}</c:otherwise></c:choose>">
-    </p><p>
-    <input class="submit" type="submit" value="Update table">
-    </p>
-  </div>
-</form>
-<br>
-<table>
-  <tr>
-    <th>Country</th>
-    <th>Mean daily users</th>
-  </tr>
-  <c:forEach var="row" items="${bridge_users_tabledata}">
-    <tr>
-      <td><a href="users.html?graph=bridge-users&country=${row['cc']}#bridge-users">${row['country']}</a> </td>
-      <td>${row['abs']} (<fmt:formatNumber type="number" minFractionDigits="2" value="${row['rel']}" /> %)</td>
-    </tr>
-  </c:forEach>
-</table>
-<hr>
-<p><a href="csv/bridge-users.csv">CSV</a> file containing all data.</p>
-<p><a href="csv/monthly-users-peak.csv">CSV</a> file containing peak daily
-Tor users (direct and bridge) per month by country.</p>
-<p><a href="csv/monthly-users-average.csv">CSV</a> file containing average
-daily Tor users (direct and bridge) per month by country.</p>
-<br>
-
-<hr>
-<hr>
-
-<a name="userstats"></a>
-<h3><a href="#userstats" class="anchor">New approach to estimating daily
-Tor users (BETA)</a></h3>
-<br>
-<p>As of April 2013, we are experimenting with a new approach to estimating
-daily Tor users.
-The new approach works very similar to the existing approach to estimate
-directly connecting users, but can also be applied to bridge users.
-This new approach can break down user numbers by country, pluggable
-transport, and IP version.
-See the tech report on
-<a href="https://research.torproject.org/techreports/counting-daily-bridge-users-2012-10-24.pdf">counting daily bridge users</a>
-and the
-<a href="https://gitweb.torproject.org/metrics-tasks.git/tree/HEAD:/task-8462">source code</a>
-for details.
-
 <a name="userstats-relay-country"></a>
-<p><b>Direct users by country (BETA):</b></p>
-
-<font color="red">
-<p>This graph is quite similar to the graphs above,
-except for the following differences:</p>
-<ul>
-<li>In contrast to the graphs above, this graph is based on
-requests to directory mirrors <i>and</i> directory authorities.
-The idea is that we want to estimate both new and recurring users.
-That is why the numbers here are higher.</li>
-<li>This graph uses byte histories for written <i>directory bytes</i>
-rather than general byte history to weight what fraction of directory
-requests a relay has answered in the network.</li>
-<li>The implementation behind this graph is much more efficient, which
-reduces time to graph from about 3 days to about 1 day.</li>
-</ul>
-</font>
+<p><b>Direct users by country:</b></p>
 
 <img src="userstats-relay-country.png${userstats_relay_country_url}"
-     width="576" height="360" alt="Direct users by country graph (BETA)">
+     width="576" height="360" alt="Direct users by country graph">
 <form action="users.html#userstats-relay-country">
   <div class="formrow">
     <input type="hidden" name="graph" value="userstats-relay-country">
@@ -283,7 +56,7 @@ reduces time to graph from about 3 days to about 1 day.</li>
 <a href="userstats-relay-country.svg${userstats_relay_country_url}">SVG</a>.</p>
 <hr>
 <a name="userstats-relay-table"></a>
-<p><b>Top-10 countries by directly connecting users (BETA):</b></p>
+<p><b>Top-10 countries by directly connecting users:</b></p>
 <form action="users.html#userstats-relay-table">
   <div class="formrow">
     <input type="hidden" name="table" value="userstats-relay">
@@ -349,16 +122,10 @@ reduces time to graph from about 3 days to about 1 day.</li>
 <hr>
 
 <a name="userstats-bridge-country"></a>
-<p><b>Bridge users by country (BETA):</b></p>
-
-<p>
-<font color="red">In contrast to the bridge-user graph above, this graph
-uses directory requests to estimate user numbers, not unique IP address sets.
-It's yet to be decided which approach is more correct.</font>
-</p>
+<p><b>Bridge users by country:</b></p>
 
 <img src="userstats-bridge-country.png${userstats_bridge_country_url}"
-     width="576" height="360" alt="Bridge users by country graph (BETA)">
+     width="576" height="360" alt="Bridge users by country graph">
 <form action="users.html#userstats-bridge-country">
   <div class="formrow">
     <input type="hidden" name="graph" value="userstats-bridge-country">
@@ -386,7 +153,7 @@ It's yet to be decided which approach is more correct.</font>
 <a href="userstats-bridge-country.svg${userstats_bridge_country_url}">SVG</a>.</p>
 <hr>
 <a name="userstats-bridge-table"></a>
-<p><b>Top-10 countries by bridge users (BETA):</b></p>
+<p><b>Top-10 countries by bridge users:</b></p>
 <form action="users.html#userstats-bridge-table">
   <div class="formrow">
     <input type="hidden" name="table" value="userstats-bridge">
@@ -418,19 +185,10 @@ It's yet to be decided which approach is more correct.</font>
 <hr>
 
 <a name="userstats-bridge-transport"></a>
-<p><b>Bridge users by transport (BETA):</b></p>
-
-<p>
-<font color="red">Almost none of the currently running bridges report the
-transport name of connecting users, which is why non-OR transport usage is
-so low.
-By default, we consider all users of a bridge OR transport users, unless told
-otherwise.
-Non-OR transport numbers will become more accurate over time.</font>
-</p>
+<p><b>Bridge users by transport:</b></p>
 
 <img src="userstats-bridge-transport.png${userstats_bridge_transport_url}"
-     width="576" height="360" alt="Bridge users by transport graph (BETA)">
+     width="576" height="360" alt="Bridge users by transport graph">
 <form action="users.html#userstats-bridge-transport">
   <div class="formrow">
     <input type="hidden" name="graph" value="userstats-bridge-transport">
@@ -460,18 +218,10 @@ Non-OR transport numbers will become more accurate over time.</font>
 <hr>
 
 <a name="userstats-bridge-version"></a>
-<p><b>Bridge users by IP version (BETA):</b></p>
-
-<p>
-<font color="red">Not all of the currently running bridges report the
-IP version of connecting users.
-By default, we consider all users of a bridge IPv4 users, unless told
-otherwise.
-IPv6 numbers will become more accurate over time.</font>
-</p>
+<p><b>Bridge users by IP version:</b></p>
 
 <img src="userstats-bridge-version.png${userstats_bridge_version_url}"
-     width="576" height="360" alt="Bridge users by IP version graph (BETA)">
+     width="576" height="360" alt="Bridge users by IP version graph">
 <form action="users.html#userstats-bridge-version">
   <div class="formrow">
     <input type="hidden" name="graph" value="userstats-bridge-version">
@@ -498,14 +248,143 @@ IPv6 numbers will become more accurate over time.</font>
 <hr>
 
 <p><a href="csv/userstats.csv">CSV</a> file containing new user
-estimates (BETA).</p>
+estimates.</p>
 <p><a href="csv/monthly-userstats-peak.csv">CSV</a> file containing peak
-daily Tor users (direct and bridge) per month by country (BETA).</p>
+daily Tor users (direct and bridge) per month by country.</p>
 <p><a href="csv/monthly-userstats-average.csv">CSV</a> file containing
-average daily Tor users (direct and bridge) per month by country
-(BETA).</p>
+average daily Tor users (direct and bridge) per month by country.</p>
 <br>
 
+<hr>
+<a name="questions-and-answers"></a>
+<p><b>Questions and answers</b></p>
+<p>
+Q: How is it even possible to count users in an anonymity network?<br/>
+A: We actually don't count users, but we count requests to the directories
+that clients make periodically to update their list of relays and estimate
+user numbers indirectly from there.
+</p>
+<p>
+Q: Do all directories report these directory request numbers?<br/>
+A: No, but we can see what fraction of directories reported them, and then
+we can extrapolate the total number in the network.
+</p>
+
+<p>
+Q: How do you get from these directory requests to user numbers?<br/>
+A: We put in the assumption that the average client makes 10 such requests
+per day.  A tor client that is connected 24/7 makes about 15 requests per
+day, but not all clients are connected 24/7, so we picked the number 10
+for the average client.  We simply divide directory requests by 10 and
+consider the result as the number of users.
+</p>
+
+<p>
+Q: So, are these distinct users per day, average number of users connected
+over the day, or what?<br/>
+A: Average number of users connected over the day.  We can't say how many
+distinct users there are.
+</p>
+
+<p>
+Q: Are these tor clients or users?  What if there's more than one user
+behind a tor client?<br/>
+A: Then we count those users as one.  We really count clients, but it's
+more intuitive for most people to think of users, that's why we say users
+and not clients.
+</p>
+
+<p>
+Q: What if a user runs tor on a laptop and changes their IP address a few
+times per day?  Don't you overcount that user?<br/>
+A: No, because that user updates their list of relays as often as a user
+that doesn't change IP address over the day.
+</p>
+
+<p>
+Q: How do you know which countries users come from?<br/>
+A: The directories resolve IP addresses to country codes and report these
+numbers in aggregate form.  This is one of the reasons why tor ships with
+a GeoIP database.
+</p>
+
+<p>
+Q: Why are there so few bridge users that are not using the default OR
+protocol or that are using IPv6?<br/>
+A: Very few bridges report data on transports or IP versions yet, and by
+default we consider requests to use the default OR protocol and IPv4.
+Once more bridges report these data, the numbers will become more
+accurate.
+</p>
+
+<p>
+Q: Why do the graphs end 2 days in the past and not today?<br/>
+A: Relays and bridges report some of the data in 24-hour intervals which
+may end at any time of the day.  And after such an interval is over relays
+and bridges might take another 18 hours to report the data.  We cut off
+the last two days from the graphs, because we want to avoid that the last
+data point in a graph indicates a recent trend change which is in fact
+just an artifact of the algorithm.
+</p>
+
+<p>
+Q: But I noticed that the last data point went up/down a bit since I last
+looked a few hours ago.  Why is that?<br/>
+A: You're an excellent observer!  The reason is that we publish user
+numbers once we're confident enough that they won't change significantly
+anymore.  But it's always possible that a directory reports data a few
+hours after we were confident enough, but which then slightly changed the
+graph.
+</p>
+
+<p>
+Q: Why are no numbers available before September 2011?<br/>
+A: We do have descriptor archives from before that time, but those
+descriptors didn't contain all the data we use to estimate user numbers.
+We do have older user numbers from an earlier estimation approach here
+(add link), but we believe the current approach is more accurate.
+</p>
+
+<p>
+Q: Why do you believe the current approach to estimate user numbers is
+more accurate?<br/>
+A: For direct users, we include all directories which we didn't do in the
+old approach.  We also use histories that only contain bytes written to
+answer directory requests, which is more precise than using general byte
+histories.
+</p>
+
+<p>
+Q: And what about the advantage of the current approach over the old one
+when it comes to bridge users?<br/>
+A: Oh, that's a whole different story.  We wrote a 13 page long
+<a href="https://research.torproject.org/techreports/counting-daily-bridge-users-2012-10-24.pdf">technical
+report</a> explaining the reasons for retiring the old approach.  But the
+old data is still <a href="/data/old-user-number-estimates.tar.gz">available</a>.
+tl;dr: in the old approach we measured the wrong thing, and now we measure
+the right thing.
+</p>
+
+<p>
+Q: Are the data and the source code for estimating these user numbers
+available?<br/>
+A: Sure, <a href="/data.html">data</a> and
+<a href="https://gitweb.torproject.org/metrics-tasks.git/tree/HEAD:/task-8462">source
+code</a> are publicly available.
+</p>
+
+<p>
+Q: What are these red and blue dots indicating possible censorship
+events?<br/>
+A: We run an anomaly-based censorship-detection system that looks at
+estimated user numbers over a series of days and predicts the user number
+in the next days.  If the actual number is higher or lower, this might
+indicate a possible censorship event or release of censorship.  For more
+details, see our
+<a href="https://research.torproject.org/techreports/detector-2011-09-09.pdf">technical
+report</a>.
+</p>
+
     </div>
   </div>
   <div class="bottom" id="bottom">



More information about the tor-commits mailing list