commit c8a55881b2f7debfa57fd4b17df101f8ceca8aae Author: Karsten Loesing karsten.loesing@gmx.net Date: Sun Oct 14 08:39:23 2012 -0400
Add bridge-stats report from April 2012. --- 2012/bridge-report-usage-stats/.gitignore | 3 + .../bridge-report-usage-stats.bib | 10 + .../bridge-report-usage-stats.tex | 217 ++++++++++++++++++++ 2012/bridge-report-usage-stats/bridgeusers.png | Bin 0 -> 69368 bytes 2012/bridge-report-usage-stats/discarded.png | Bin 0 -> 78228 bytes 2012/bridge-report-usage-stats/notreported.png | Bin 0 -> 102217 bytes 2012/bridge-report-usage-stats/reported.png | Bin 0 -> 85090 bytes 2012/bridge-report-usage-stats/tortechrep.cls | 1 + 8 files changed, 231 insertions(+), 0 deletions(-)
diff --git a/2012/bridge-report-usage-stats/.gitignore b/2012/bridge-report-usage-stats/.gitignore new file mode 100644 index 0000000..c787837 --- /dev/null +++ b/2012/bridge-report-usage-stats/.gitignore @@ -0,0 +1,3 @@ +bridge-report-usage-stats.pdf +bridge-report-usage-stats-2012-04-30.pdf + diff --git a/2012/bridge-report-usage-stats/bridge-report-usage-stats.bib b/2012/bridge-report-usage-stats/bridge-report-usage-stats.bib new file mode 100644 index 0000000..289b32b --- /dev/null +++ b/2012/bridge-report-usage-stats/bridge-report-usage-stats.bib @@ -0,0 +1,10 @@ +@techreport{tor-2010-11-001, + author = {Sebastian Hahn and Karsten Loesing}, + title = {Privacy-preserving Ways to Estimate the Number of {Tor} Users}, + institution = {The Tor Project}, + number = {2010-11-001}, + year = {2010}, + month = {November}, + url = {https://research.torproject.org/techreports/countingusers-2010-11-30.pdf%7D +} + diff --git a/2012/bridge-report-usage-stats/bridge-report-usage-stats.tex b/2012/bridge-report-usage-stats/bridge-report-usage-stats.tex new file mode 100644 index 0000000..969d63a --- /dev/null +++ b/2012/bridge-report-usage-stats/bridge-report-usage-stats.tex @@ -0,0 +1,217 @@ +\documentclass{tortechrep} +\usepackage{url} +\usepackage{graphicx} +\begin{document} + +\title{What fraction of our bridges are\not reporting usage statistics?} +\author{Karsten Loesing} +\contact{karsten@torproject.org} +\reportid{2012-04-001} +\date{April 30, 2012} +\maketitle + +\section{Introduction} + +Tor's current approach to count daily bridge users is probably broken. +The estimate of daily bridge users from all countries ranges between a few +hundred to half a million in the time between mid-2008 and early 2012 (see +Figure~\ref{fig:bridgeusers}). +We have little idea whether the real number is closer to the lower or the +upper end. +It's probably ``somewhere in the middle.'' + +\begin{figure}[t] +\includegraphics[width=\textwidth]{bridgeusers.png} +\caption{Estimated bridge users from all countries between 2008 and 2012.} +\label{fig:bridgeusers} +\end{figure} + +The current approach to estimate the number of bridge users is based on +bridges reporting the number of unique IP addresses they see in a given +24-hour timeframe to the bridge authority. +We collect all reports, sum up unique IP addresses per day, and interpret +the result as estimated user number. + +We already identied two shortcomings in this +approach~\cite{tor-2010-11-001}: The first shortcoming is that the +assumption that a bridge user only connects to a single bridge is very +likely false. +As a result we may over-count bridge clients connecting to two or more +bridges. +The second shortcoming is that we're excluding a yet unknown fraction of +bridges which don't report usage statistics to the bridge authority. +A possible reason for not reporting statistics is the minimum uptime of +24~hours before publishing statistics which has the purpose of hiding +exact connection times to protect the users' privacy. + +In this report we want to focus on the second shortcoming by analyzing +what fraction of bridges are not reporting usage statistics. +Obviously, whether this fraction is at 20% or at 80% has a major impact +on the estimated number of bridge users. +But in addition to that we hope to learn something more general about how +bridges report statistics to the bridge authority that we can apply to new +approaches that estimate daily bridge users. + +In the following we discuss reasons for discarding reported bridge +statistics and possible causes for bridges not to report statistics. +We then look into the bridge descriptor archives to quantify what fraction +of bridges are affected by these cases. +We conclude with ideas for increasing the fraction of included statistics. + +\section{Reasons for missing bridge usage statistics} + +There are two categories of reasons for missing bridge usage statistics: +either a bridge reports statistics which are discarded, or the bridge does +not report statistics at all. +Reasons for discarding reported statistics are: + +\begin{enumerate} +\item \textbf{Running as non-bridge relay:} We exclude all statistics from +bridges that have been running as non-bridge relays before. +The reason is that non-bridge clients may still connect to such a bridge. +We expect there to be many more directly connecting users than bridge +users, so including these statistics might lead to greatly overestimating +the number of bridge users. +We currently exclude statistics from bridges which have been running as +relay at any time in the past, even months ago. +We had cases where excluding such a bridge removed a sudden increase in +bridge user numbers which could not be explained otherwise. +\item \textbf{Known bug in statistics code:} There are a few Tor versions +which had bugs in their statistics implementation. +We exclude these statistics, too. +\item \textbf{Missing geoip file:} We recently discovered that bridges +which don't have a geoip file still report bridge usage statistics with +all zeros. +For the current approach where we sum up all observations, this isn't a +problem. +But it's still interesting to learn how wide-spread the problem of missing +geoip files on bridges is. +Only bridges running Tor version 0.2.3.1-alpha or higher report whether +they have a geoip file congured or not. +\end{enumerate} + +In addition to these cases, there are a few possible causes for bridges +not reporting statistics: + +\begin{enumerate} +\setcounter{enumi}{3} +\item \textbf{Less than 24 hours uptime:} Bridges which have an uptime of +less than 24 hours don't report statistics for this period of time. +This has to do with the requirement to aggregate observations for a +sufficient amount of time to hide exact connection times and protect the +users' privacy. +\item \textbf{Descriptor publication delay:} Some bridges may even +complete a 24-hour interval and prepare statistics to be reported in their +next descriptor. +But then they go offline and don't publish that descriptor. +Bridges look at previously finished statistics intervals when starting up, +but either a bridge decides that its previous statistics are too old to be +published, or a bridge never shows up again. +The fix here might be to make bridges publish a new descriptor immediately +after finishing a statistics interval, which is suggested as enhancement +#4142. +We should probably find out how many bridges are affected by this problem +before implementing the fix. +\item \textbf{Other reasons:} There may be other causes for a bridge not +reporting statistics which we did not identify. +\end{enumerate} + +\section{Fraction of missing bridge usage statistics} + +After listing reasons for reported observations being discarded and for +bridges not reporting statistics at all, we now want to quantify how many +bridges are affected by which case. + +\begin{figure}[t] +\includegraphics[width=\textwidth]{reported.png} +\caption{Fraction of bridges that reported statistics which were either +used or discarded, or that did not report statistics.} +\label{fig:reported} +\end{figure} + +Figure~\ref{fig:reported} shows the fraction of bridges that did or did +not report usage statistics and how many of these reports had to be +discarded. +The graph shows an almost monotonic downward trend of non-reported +statistics from 2008 to early 2010 to around 20%. +This fraction went up only slightly in 2010 and 2011 to 40% and is now +back at 20%. +The fraction of reported and discarded statistics was between 10% and +20% for most of the time between 2008 and today. +As a result, the fraction of reported and used statistics went up from +around 35% in early 2009 to around 75% in early 2012. + +These results are much better than expected before. +A bridge usage statistic that is based on 75% of all bridges at least +rules out inaccuracies from too little sample sizes. +We still want to look into reasons for discarded or not reported +statistics. + +\begin{figure}[t] +\includegraphics[width=\textwidth]{discarded.png} +\caption{Reasons for discarding reported usage statistics.} +\label{fig:discarded} +\end{figure} + +Figure~\ref{fig:discarded} shows what fractions of reported bridge +statistics were discarded for what reasons. +The fractions of statistics that had to be discarded because of the +geoip-stats bug in Tor 0.2.2.x or because of missing geoip files are at +almost 0% for most of the time. +Only in late 2009, the geoip-stats bug affected up to 5% of bridges. +But the fraction of discarded statistics because of bridges previously +running as non-bridge relays is quite high at 10% to 20%. + +It's quite likely that we could reduce this fraction by being less strict +about bridges running as non-bridge relays. +In theory, a delay of a few days between running as relay and running as +bridge should be sucient to exclude directly connecting clients from the +statistics. +However, this requires further analysis. + +\begin{figure}[t] +\includegraphics[width=\textwidth]{notreported.png} +\caption{Reasons for bridges not reporting usage statistics.} +\label{fig:notreported} +\end{figure} + +Figure~\ref{fig:notreported} shows what fractions of bridges don't report +statistics for what reasons. +For most of the time, the fraction of bridges not reporting statistics +because they went online before their 24-hour interval ended was about as +large as the fraction of ``other reasons.'' +Only recently, the ``other reasons'' dropped to almost 0%. +The fraction of missing statistics due to a delay between completing a +statistics interval and publishing the descriptor containing those +statistics is almost at 0% for most of the time. +There's probably not much that we can do about the first category of +bridges which go online before their 24-hour interval ends. +This interval is there to hide exact connection times and to protect the +users' privacy. +Any algorithm will have to cope with 15% to 25% missing statistics due +to the 24-hour interval requirement. +Fortunately, the fraction of non-reported statistics due to the descriptor +publication delay is almost at 0%, so we don't have to fix that. +It's unclear what other reasons led to bridges not publishing statistics. +Given that this fraction is almost at 0%, there's no immediate need to +investigate. + +\section{Conclusion} + +In this report we analyzed what fraction of bridges are not reporting +usage statistics, which might affect our daily bridge user estimates. +The analysis of bridge descriptor archives resulted in a fraction of up to +75% of bridges reporting usage statistics that get used to estimate user +numbers. +This fraction might even be increased by discarding fewer statistics from +bridges that were seen as non-bridge relays before. +We conclude that a too small sample size is not the issue of our probably +wrong bridge user numbers. +We think that a new approach that will be based on bridges reporting their +findings in 24-hour intervals has a good chance of leading to quite +reliable user numbers. + +\bibliography{bridge-report-usage-stats} + +\end{document} + diff --git a/2012/bridge-report-usage-stats/bridgeusers.png b/2012/bridge-report-usage-stats/bridgeusers.png new file mode 100755 index 0000000..a77e87a Binary files /dev/null and b/2012/bridge-report-usage-stats/bridgeusers.png differ diff --git a/2012/bridge-report-usage-stats/discarded.png b/2012/bridge-report-usage-stats/discarded.png new file mode 100755 index 0000000..a96a604 Binary files /dev/null and b/2012/bridge-report-usage-stats/discarded.png differ diff --git a/2012/bridge-report-usage-stats/notreported.png b/2012/bridge-report-usage-stats/notreported.png new file mode 100755 index 0000000..d44f352 Binary files /dev/null and b/2012/bridge-report-usage-stats/notreported.png differ diff --git a/2012/bridge-report-usage-stats/reported.png b/2012/bridge-report-usage-stats/reported.png new file mode 100755 index 0000000..fa5640f Binary files /dev/null and b/2012/bridge-report-usage-stats/reported.png differ diff --git a/2012/bridge-report-usage-stats/tortechrep.cls b/2012/bridge-report-usage-stats/tortechrep.cls new file mode 120000 index 0000000..4c24db2 --- /dev/null +++ b/2012/bridge-report-usage-stats/tortechrep.cls @@ -0,0 +1 @@ +../../tortechrep.cls \ No newline at end of file
tor-commits@lists.torproject.org