[tor-commits] [metrics-web/master] Move Q-and-A about user statistics to a text file.

Thu Jun 26 14:48:14 UTC 2014

commit f995b0f64febfe288af0f09efe0ef3b68c02100d
Author: Karsten Loesing <karsten.loesing at gmx.net>
Date:   Thu Jun 26 16:43:03 2014 +0200

    Move Q-and-A about user statistics to a text file.
    
    These questions and answers are likely read by less than 10% of visitors,
    and the remaining 90% wonder what that wall of text is.  We should write a
    more general Q-and-A section covering the entire website.  Whoever cares
    about how user statistics are calculated can read the text file.
    
    Tweak the Q's and A's a tiny bit while converting them to plain text.
---
 doc/users-q-and-a.txt         |   94 +++++++++++++++++++++++++++++
 website/web/WEB-INF/users.jsp |  131 +----------------------------------------
 2 files changed, 96 insertions(+), 129 deletions(-)

diff --git a/doc/users-q-and-a.txt b/doc/users-q-and-a.txt
new file mode 100644
index 0000000..15a1084
--- /dev/null
+++ b/doc/users-q-and-a.txt
@@ -0,0 +1,94 @@
+Questions and answers about user statistics
+===========================================
+
+Q: How is it even possible to count users in an anonymity network?
+A: We actually don't count users, but we count requests to the directories
+that clients make periodically to update their list of relays and estimate
+user numbers indirectly from there.
+
+Q: Do all directories report these directory request numbers?
+A: No, but we can see what fraction of directories reported them, and then
+we can extrapolate the total number in the network.
+
+Q: How do you get from these directory requests to user numbers?
+A: We put in the assumption that the average client makes 10 such requests
+per day.  A tor client that is connected 24/7 makes about 15 requests per
+day, but not all clients are connected 24/7, so we picked the number 10
+for the average client.  We simply divide directory requests by 10 and
+consider the result as the number of users.  Another way of looking at it,
+is that we assume that each request represents a client that stays online
+for one tenth of a day, so 2 hours and 24 minutes.
+
+Q: So, are these distinct users per day, average number of users connected
+over the day, or what?
+A: Average number of concurrent users, estimated from data collected over
+a day.  We can't say how many distinct users there are.
+
+Q: Are these tor clients or users?  What if there's more than one user
+behind a tor client?
+A: Then we count those users as one.  We really count clients, but it's
+more intuitive for most people to think of users, that's why we say users
+and not clients.
+
+Q: What if a user runs tor on a laptop and changes their IP address a few
+times per day?  Don't you overcount that user?
+A: No, because that user updates their list of relays as often as a user
+that doesn't change IP address over the day.
+
+Q: How do you know which countries users come from?
+A: The directories resolve IP addresses to country codes and report these
+numbers in aggregate form.  This is one of the reasons why tor ships with
+a GeoIP database.
+
+Q: Why are there so few bridge users that are not using the default OR
+protocol or that are using IPv6?
+A: Very few bridges report data on transports or IP versions yet, and by
+default we consider requests to use the default OR protocol and IPv4.
+Once more bridges report these data, the numbers will become more
+accurate.
+
+Q: Why do the graphs end 2 days in the past and not today?
+A: Relays and bridges report some of the data in 24-hour intervals which
+may end at any time of the day.  And after such an interval is over relays
+and bridges might take another 18 hours to report the data.  We cut off
+the last two days from the graphs, because we want to avoid that the last
+data point in a graph indicates a recent trend change which is in fact
+just an artifact of the algorithm.
+
+Q: But I noticed that the last data point went up/down a bit since I last
+looked a few hours ago.  Why is that?
+A: The reason is that we publish user numbers once we're confident enough
+that they won't change significantly anymore.  But it's always possible
+that a directory reports data a few hours after we were confident enough,
+but which then slightly changed the graph.
+
+Q: Why are no numbers available before September 2011?
+A: We do have descriptor archives from before that time, but those
+descriptors didn't contain all the data we use to estimate user numbers.
+
+Q: Why do you believe the current approach to estimate user numbers is
+more accurate?
+A: For direct users, we include all directories which we didn't do in the
+old approach.  We also use histories that only contain bytes written to
+answer directory requests, which is more precise than using general byte
+histories.
+
+Q: And what about the advantage of the current approach over the old one
+when it comes to bridge users?
+A: Oh, that's a whole different story.  We wrote a 13 page long technical
+report explaining the reasons for retiring the old approach.  tl;dr: in
+the old approach we measured the wrong thing, and now we measure the right
+thing.
+
+  https://research.torproject.org/techreports/counting-daily-bridge-users-2012-10-24.pdf
+
+Q: What are these red and blue dots indicating possible censorship
+events?
+A: We run an anomaly-based censorship-detection system that looks at
+estimated user numbers over a series of days and predicts the user number
+in the next days.  If the actual number is higher or lower, this might
+indicate a possible censorship event or release of censorship.  For more
+details, see our technical report.
+
+  https://research.torproject.org/techreports/detector-2011-09-09.pdf
+
diff --git a/website/web/WEB-INF/users.jsp b/website/web/WEB-INF/users.jsp
index 84cab43..0a31569 100644
--- a/website/web/WEB-INF/users.jsp
+++ b/website/web/WEB-INF/users.jsp
@@ -269,136 +269,9 @@ estimates.</p>
 <br>
 
 <hr>
-<a name="questions-and-answers"></a>
-<p><b>Questions and answers</b></p>
-<p>
-Q: How is it even possible to count users in an anonymity network?<br/>
-A: We actually don't count users, but we count requests to the directories
-that clients make periodically to update their list of relays and estimate
-user numbers indirectly from there.
-</p>
-<p>
-Q: Do all directories report these directory request numbers?<br/>
-A: No, but we can see what fraction of directories reported them, and then
-we can extrapolate the total number in the network.
-</p>
 
-<p>
-Q: How do you get from these directory requests to user numbers?<br/>
-A: We put in the assumption that the average client makes 10 such requests
-per day.  A tor client that is connected 24/7 makes about 15 requests per
-day, but not all clients are connected 24/7, so we picked the number 10
-for the average client.  We simply divide directory requests by 10 and
-consider the result as the number of users.  Another way of looking at it,
-is that we assume that each request represents a client that stays online
-for 2 hours and 24 minutes.
-</p>
-
-<p>
-Q: So, are these distinct users per day, average number of users connected
-over the day, or what?<br/>
-A: Average number of concurrent users, estimated from data collected over
-a day.  We can't say how many distinct users there are.
-</p>
-
-<p>
-Q: Are these tor clients or users?  What if there's more than one user
-behind a tor client?<br/>
-A: Then we count those users as one.  We really count clients, but it's
-more intuitive for most people to think of users, that's why we say users
-and not clients.
-</p>
-
-<p>
-Q: What if a user runs tor on a laptop and changes their IP address a few
-times per day?  Don't you overcount that user?<br/>
-A: No, because that user updates their list of relays as often as a user
-that doesn't change IP address over the day.
-</p>
-
-<p>
-Q: How do you know which countries users come from?<br/>
-A: The directories resolve IP addresses to country codes and report these
-numbers in aggregate form.  This is one of the reasons why tor ships with
-a GeoIP database.
-</p>
-
-<p>
-Q: Why are there so few bridge users that are not using the default OR
-protocol or that are using IPv6?<br/>
-A: Very few bridges report data on transports or IP versions yet, and by
-default we consider requests to use the default OR protocol and IPv4.
-Once more bridges report these data, the numbers will become more
-accurate.
-</p>
-
-<p>
-Q: Why do the graphs end 2 days in the past and not today?<br/>
-A: Relays and bridges report some of the data in 24-hour intervals which
-may end at any time of the day.  And after such an interval is over relays
-and bridges might take another 18 hours to report the data.  We cut off
-the last two days from the graphs, because we want to avoid that the last
-data point in a graph indicates a recent trend change which is in fact
-just an artifact of the algorithm.
-</p>
-
-<p>
-Q: But I noticed that the last data point went up/down a bit since I last
-looked a few hours ago.  Why is that?<br/>
-A: You're an excellent observer!  The reason is that we publish user
-numbers once we're confident enough that they won't change significantly
-anymore.  But it's always possible that a directory reports data a few
-hours after we were confident enough, but which then slightly changed the
-graph.
-</p>
-
-<p>
-Q: Why are no numbers available before September 2011?<br/>
-A: We do have descriptor archives from before that time, but those
-descriptors didn't contain all the data we use to estimate user numbers.
-We do have older user numbers from an earlier estimation approach
-<a href="/data/old-user-number-estimates.tar.gz">here</a>, but we believe
-the current approach is more accurate.
-</p>
-
-<p>
-Q: Why do you believe the current approach to estimate user numbers is
-more accurate?<br/>
-A: For direct users, we include all directories which we didn't do in the
-old approach.  We also use histories that only contain bytes written to
-answer directory requests, which is more precise than using general byte
-histories.
-</p>
-
-<p>
-Q: And what about the advantage of the current approach over the old one
-when it comes to bridge users?<br/>
-A: Oh, that's a whole different story.  We wrote a 13 page long
-<a href="https://research.torproject.org/techreports/counting-daily-bridge-users-2012-10-24.pdf">technical
-report</a> explaining the reasons for retiring the old approach.
-tl;dr: in the old approach we measured the wrong thing, and now we measure
-the right thing.
-</p>
-
-<p>
-Q: Are the data and the source code for estimating these user numbers
-available?<br/>
-A: Sure, <a href="/data.html">data</a> and
-<a href="https://gitweb.torproject.org/metrics-tasks.git/tree/HEAD:/task-8462">source
-code</a> are publicly available.
-</p>
-
-<p>
-Q: What are these red and blue dots indicating possible censorship
-events?<br/>
-A: We run an anomaly-based censorship-detection system that looks at
-estimated user numbers over a series of days and predicts the user number
-in the next days.  If the actual number is higher or lower, this might
-indicate a possible censorship event or release of censorship.  For more
-details, see our
-<a href="https://research.torproject.org/techreports/detector-2011-09-09.pdf">technical
-report</a>.
-</p>
+<p><a href="https://gitweb.torproject.org/metrics-web.git/blob/HEAD:/doc/users-q-and-a.txt">Questions
+and answers about users statistics</a></p>
 
     </div>
   </div>