[tech-reports/master] Tweak extrapolation report before publication.

9 Feb 2015

commit 75f5c0c13c2a858ad77309b4b468b39f1003721c
Author: Karsten Loesing <karsten.loesing@gmx.net>
Date:   Sun Feb 8 19:19:10 2015 +0100

    Tweak extrapolation report before publication.
---
 .../extrapolating-hidserv-stats.tex                |  482 ++++++++++++--------
 1 file changed, 285 insertions(+), 197 deletions(-)

diff --git a/2015/extrapolating-hidserv-stats/extrapolating-hidserv-stats.tex b/2015/extrapolating-hidserv-stats/extrapolating-hidserv-stats.tex
index 053a081..bef857f 100644
--- a/2015/extrapolating-hidserv-stats/extrapolating-hidserv-stats.tex
+++ b/2015/extrapolating-hidserv-stats/extrapolating-hidserv-stats.tex
@@ -4,32 +4,69 @@
 \usepackage{url}
 \begin{document}
 
-\title{Extrapolating network totals from hidden-service statistics}
+\title{Extrapolating network totals\\from hidden-service statistics}
 
-\author{yet unnamed authors}
+\author{George Kadianakis and Karsten Loesing}
 
-\reportid{DRAFT}
-\date{to be published in January 2015}
+\contact{
+\href{mailto:asn@torproject.org}{asn@torproject.org},%
+\href{mailto:karsten@torproject.org}{karsten@torproject.org}}
+
+\reportid{2015-01-001}
+\date{January 31, 2015}
 
 \maketitle
 
 \begin{abstract}
 Starting on December 19, 2014, we added two new statistics to the Tor
 software that shall give us some first insights into hidden-service usage.
-The first is the number of .onion addresses observed by a hidden-service
-directory, and the second is the number of cells on rendezvous circuits
-observed by a rendezvous point.
+The first statistic is the number of cells on rendezvous circuits observed
+by a rendezvous point, and the second is the number of unique .onion
+addresses observed by a hidden-service directory.
 Each relay that opts in to reporting these statistics publishes these two
 numbers for 24-hour intervals of operation.
-In the following, we explain our approach for extrapolating network totals
+In the following, we describe an approach for extrapolating network totals
 from these statistics.
-The goal is to learn how many unique .onion addresses exist in the network
-and what amount of traffic can be attributed to hidden-service usage.
-We show that we can extrapolate network totals from hidden-service
-statistics with reasonable accuracy as long as at least 1\% of relays
-report these statistics.
+The goal is to learn what amount of traffic can be attributed to
+hidden-service usage and how many unique .onion addresses exist in the
+network.
+We show that we can extrapolate network totals with reasonable accuracy as
+long as at least 1\% of relays report these statistics.
 \end{abstract}
 
+\section*{Introduction}
+
+As of December 19, 2014, a small number of relays has started reporting
+statistics on hidden-service usage.
+Similar to other statistics, these statistics are based solely on what the
+reporting relay observes, without exchanging observations with other
+relays.
+In this report we describe a method for extrapolating these statistics to
+network totals.
+
+\begin{figure}
+\centering
+\includegraphics[width=.8\textwidth]{overview.pdf}
+\caption{Overview of the extrapolation method used for extrapolating
+network totals from hidden-service statistics.}
+\label{fig:overview}
+\end{figure}
+
+Figure~\ref{fig:overview} gives an overview of the extrapolation method
+where each step corresponds to a section in this report.
+In step~1 we parse the statistics that relays report in their extra-info
+descriptors.
+These statistics contain noise that was added by relays to obfuscate
+original observations, which we attempt to remove in step~2.
+In step~3 we process consensuses to derive network fractions of reporting
+relays, that is, what fraction of hidden-service usage a relay should have
+observed.
+We use these fractions to remove implausible statistics in step~4.
+Then we extrapolate network totals in step~5, where each extrapolation is
+based on the report from a single relay.
+Finally, in step~6 we select daily averages from these network totals
+which constitutes our result.
+
 \section{Parsing reported statistics}
 
 There are two types of documents produced by Tor relays that we consider
@@ -40,6 +77,19 @@ The second are consensuses that indicate what fraction of hidden-service
 descriptors a hidden-service directory has observed and what fraction of
 rendezvous circuits a relay has handled.
 
+We start by describing how we're parsing and processing hidden-service
+statistics from extra-info descriptors.
+Figure~\ref{fig:num-reported-stats} shows the number of statistics
+reported by day, and Figure~\ref{fig:extrainfo} shows a sample.
+The relevant parts for this analysis are:
+
+\begin{figure}[b]
+\centering
+\includegraphics[width=\textwidth]{graphics/num-reported-stats.pdf}
+\caption{Number of reported hidden-service statistics.}
+\label{fig:num-reported-stats}
+\end{figure}
+
 % SAMPLE:
 % fingerprint F528DED21EACD2E4E9301EC0AABD370EDCAD2C47
 % stats_start 2014-12-31 16:17:33
@@ -49,7 +99,7 @@ rendezvous circuits a relay has handled.
 % prob_rend_point 0.01509326
 % frac_hsdesc 0.00069757
 
-\begin{figure}[b]
+\begin{figure}
 \begin{verbatim}
 extra-info ryroConoha F528DED21EACD2E4E9301EC0AABD370EDCAD2C47
 [...]
@@ -62,12 +112,6 @@ descriptor.}
 \label{fig:extrainfo}
 \end{figure}
 
-We start by describing how we're parsing and processing hidden-service
-statistics from extra-info descriptors.
-Figure~\ref{fig:extrainfo} shows a sample of hidden-service statistics as
-contained in extra-info descriptors.
-The relevant parts for this analysis are:
-
 \begin{itemize}
 \item The \verb+extra-info+ line tells us which relay reported these
 statistics, which we need to know to derive what fraction of
@@ -81,21 +125,14 @@ The value for \verb+bin_size+ is the bin size used for rounding up the
 originally observed cell number, and the values for \verb+delta_f+ and
 \verb+epsilon+ are inputs for the additive noise following a Laplace
 distribution.
+For more information on how obfuscation is performed, please see Tor
+proposal 238.%
+\footnote{\url{https://gitweb.torproject.org/torspec.git/tree/proposals/238-hs-relay-stats.txt}}
 \item And finally, the \verb+hidserv-dir-onions-seen+ line tells us the
 number of .onion addresses that the relay observed in published
 hidden-service descriptors in its role as hidden-service directory.
 \end{itemize}
 
-\begin{figure}
-\centering
-\includegraphics[width=\textwidth]{graphics/num-reported-stats.pdf}
-\caption{Number of relays reporting hidden-service statistics.}
-\label{fig:num-reported-stats}
-\end{figure}
-
-Figure~\ref{fig:num-reported-stats} shows the number of statistics
-reported by day.
-
 \section{Removing previously added noise}
 
 When processing hidden-service statistics, we need to handle the fact that
@@ -112,24 +149,19 @@ Following these steps, the statistics reported in
 Figure~\ref{fig:extrainfo} are processed to 152599040~cells and 84~.onion
 addresses.
 For the subsequent analysis we're also converting cells/day to
-bytes/second by multiplying cell numbers with 512~bytes/cell, dividing by
-86400~seconds/day, and dividing by 2 to account for the fact that
-statistics include cells in both incoming and outgoing direction.
-As a result we obtain 452~KB/s in the given sample.
+bits/second by multiplying cell numbers with 512~bytes/cell, multiplying
+with 8~bits/byte, dividing by 86400~seconds/day, and dividing by 2 to
+account for the fact that statistics include cells in both incoming and
+outgoing direction.
+As a result we obtain 3.6~Mbit/s in the given sample.
 
 Figure~\ref{fig:stats-by-day} shows parsed values after removing
 previously added noise.
 Negative values are the result of relays adding negative
-Laplace-distributed noise values to very small observed values.
-We will describe an attempt to remove such values shortly.
-\footnote{A plausible step three in the previously described process could
-have been to round negative values to 0, because that represents the most
-likely rounded value before Laplace noise was added.
-However, removing negative values would add bias to the result, because it
-would only remove negative noise without being able to detect and remove
-positive noise.
-That's why we'd rather want to remove implausible values based on other
-criteria.}
+Laplace-distributed noise values to very small observed values, which we
+cannot remove easily.
+We will describe an attempt to remove such values in
+Sections~\ref{sec:implausible} and \ref{sec:averages}.
 
 \begin{figure}
 \centering
@@ -144,12 +176,12 @@ Laplace-distributed noise values to very small observed values.}
 \section{Deriving network fractions from consensuses}
 
 The second document type that we consider in our analysis are consensuses.
-Not all hidden-service directories observe the same number of
-hidden-service descriptors, and the probability of chosing a relay as
-rendezvous point is even less uniformly distributed.
-Fortunately, we can derive what fraction of descriptors a directory was
-responsible for and what fraction of rendezvous circuits a relay has
-handled.
+The probability of choosing a relay as rendezvous point varies a lot
+between relays, and not all hidden-service directories handle the same
+number of hidden-service descriptors.
+Fortunately, we can derive what fraction of rendezvous circuits a relay
+has handled and what fraction of descriptors a directory was responsible
+for.
 
 \begin{figure}
 \begin{verbatim}
@@ -179,11 +211,33 @@ directories preceding it.}
 \end{figure}
 
 Figure~\ref{fig:consensusentry} shows the consensus entry of the relay
-that submitted the sample hidden-service statistics mentioned above.
+that submitted the sample hidden-service statistics mentioned above, plus
+neighboring consensus entries.
+
+The first fraction that we compute is the probability of a relay to be
+selected as rendezvous point.
+Clients only select relays with the \verb+Fast+ flag and in some cases the
+\verb+Stable+ flag, and they weight relays differently based on their
+bandwidth and depending on whether they have the \verb+Exit+ and/or
+\verb+Guard+ flags.
+(Clients require relays to have the \verb+Stable+ flag if they attempt to
+establish a long-running connection, e.g., to a hidden SSH server, but in
+the following analysis, we assume that most clients establish connections
+that don't need to last for long, e.g., to hidden webservers.)
+Clients weight the bandwidth value contained in the consensus entry with
+the value of \verb+Wmg+, \verb+Wme+, \verb+Wmd+, or \verb+Wmm+, depending
+on whether the relay has only the \verb+Guard+ flag, only the \verb+Exit+
+flag, both such flags, or neither of them.
+
+Our sample relay, \texttt{ryroConoha}, has the \verb+Fast+ flag, a
+bandwidth value of 117000, and neither \verb+Guard+ nor \verb+Exit+ flag.
+Its probability for being selected as rendezvous point is calculated as
+$117000 \times 10000/10000$ divided by the sum of all such weights in the
+consensus, in this case $1.42\%$.
 
-The first fraction that we can derive from this entry is the fraction of
-descriptor space that this relay was responsible for in its role as
-hidden-service directory.
+The second fraction that we can derive from this consensus entry is the
+fraction of descriptor space that this relay was responsible for in its
+role as hidden-service directory.
 The Tor Rendezvous
 Specification\footnote{\url{https://gitweb.torproject.org/torspec.git/tree/rend-spec.txt}}
 contains the following definition that is relevant here:
@@ -195,68 +249,66 @@ three identity digests of HSDir relays following the descriptor ID in a
 circular list.}
 \end{quote}
 
+Based on the fraction of descriptor space that a directory was responsible
+for we can compute the fraction of descriptors that this directory has
+seen.
+Intuitively, one might think that these fractions are the same.
+However, this is not the case: each descriptor that is published to a
+directory is also published to two other directories.
+As a result we need to divide the fraction of descriptor space by
+\emph{three} to obtain the fraction of descriptors observed the directory.
+Note that, without dividing by three, fractions of all directories would
+not add up to 100\%.
+
 In the sample consensus entry, we'd extract the base64-encoded fingerprint
 of the statistics-reporting relay, \verb+9Sje0h6...+, and the fingerprint
 of the hidden-service directory that precedes the relay by three
 positions, \verb+9PodlaV...+, and compute what fraction of descriptor
-space that is, in this case $0.07\%$.
+space that is, in this case $0.071$.
+So, the relay has observed $0.024\%$ of descriptors in the network.
 
-The second fraction that we compute is the probability of a relay to be
-selected as rendezvous point.
-Clients select only relays with the \verb+Fast+ and in some cases the
-\verb+Stable+ flag, and they weigh relays differently based on their
-bandwidth and depending on whether they have the \verb+Exit+ and/or
-\verb+Guard+ flags.
-(Clients further require relays to have the \verb+Stable+ flag if they
-attempt to establish a long-running connection, e.g., to a hidden SSH
-server, but in the following analysis, we assume that most clients
-establish connections that don't need to last for long, e.g., to a hidden
-webserver.)
-Clients weigh the bandwidth value contained in the consensus with the
-value of \verb+Wmg+, \verb+Wme+, \verb+Wmd+, or \verb+Wmm+, depending on
-whether the relay has only the \verb+Guard+ flag, only the \verb+Exit+
-flag, both such flags, or neither of them.
-
-Our sample relay has the \verb+Fast+ flag, a bandwidth value of 117,000,
-and neither \verb+Guard+ nor \verb+Exit+ flag.
-Its probability for being selected as rendezvous point is calculated as
-$117000 \times 10000/10000$ divided by the sum of all such weights in the
-consensus, in this case $1.42\%$
+% 9Sje0h6... -> F528DED2 ->   4113096402
+% 9PodlaV... -> F4FA1D95 -> - 4110032277
+%                           =    3064125
+%                           / 4294967296
+%                           = 0.00071342
+%                           /          3
+%                           = 0.00023781
 
 \begin{figure}
 \centering
 \includegraphics[width=\textwidth]{graphics/probs-by-relay.pdf}
-\caption{Calculated probabilities for observing hidden-service activity.}
+\caption{Calculated network fractions of relays observing hidden-service activity.}
 \label{fig:probs-by-relay}
 \end{figure}
 
-Figure~\ref{fig:probs-by-relay} shows calculated probabilities of
-observing hidden-service activities of relays reporting hidden-service
+Figure~\ref{fig:probs-by-relay} shows calculated fractions of
+hidden-service activity observed by relays that report hidden-service
 statistics.
-That figure shows that most relays have roughly the same (small)
-probability for observing a hidden-service descriptor with only few
-outliers.
-The probability for being selected as rendezvous point is much smaller for
-most relays, with only the outliers having a realistic chance of being
+The probability for being selected as rendezvous point is very small for
+most relays, with only very few relays having a realistic chance of being
 selected.
+In comparison, most relays have roughly the same (small) probability for
+observing a hidden-service descriptor with only few exceptions.
 
 \section{Removing implausible statistics}
+\label{sec:implausible}
 
 A relay that opts in to gathering hidden-service statistics reports them
 even if it couldn't plausibly have observed them.
-In particular, a relay that did not have the \verb+HSDir+ flag could not
-have observed a single .onion address, and a relay with the \verb+Exit+
-flag could not have been selected as rendezvous point as long as
-\verb+Wmd+ and \verb+Wme+ are zero.
+In particular, a relay with the \verb+Exit+ flag could not have been
+selected as rendezvous point as long as \verb+Wmd+ and \verb+Wme+ are
+zero, and a relay that did not have the \verb+HSDir+ flag could not have
+observed a single .onion address.
+
 Figure~\ref{fig:zero} shows distributions of reported statistics of relays
-with calculated probabilities of exactly zero.
+with calculated fractions of exactly zero.
 These reported values approximately follow the plotted Laplace
 distributions with $\mu=0$ and $b=2048/0.3$ or $b=8/0.3$ as defined for
-the respective statistics.
-We can assume that the vast majority of these reported values are just
-noise.
-In the following analysis, we exclude relays with calculated probabilities
-of exactly 0.
+the respective statistics, which gives us confidence that the vast
+majority of these reported values are just noise.
+In the following analysis, we exclude relays with calculated fractions of
+exactly 0.
 
 \begin{figure}
 \centering
@@ -271,38 +323,36 @@ of exactly 0.
 \caption{Statistics reported by relays with calculated probabilities of
 observing these statistics of zero.
 The blue lines show Laplace distributions with $\mu=0$ and $b=2048/0.3$ or
-$b=8/0.3$ as defined for the respective statistics.}
+$b=8/0.3$ as defined for the respective statistics.
+The lowest 1\% and highest 1\% of values have been removed for display
+purposes.}
 \label{fig:zero}
 \end{figure}
 
-Another cause for implausible statistics could be very large positive or
-negative noise added by the Laplace distribution.
+Another kind of implausible statistics are very high or very low absolute
+reported numbers.
+These numbers could be the result of adding very large positive or
+negative numbers from the Laplace distribution.
 In theory, a single relay, with non-zero probability of observing
 hidden-service activity, could have added noise from $-\infty$ to
-$\infty$, which could derail statistics for the entire day.
-These extreme values could be removed by calculating an interval of
-plausible values for each relay, based on the probability of observing
-hidden-service activity, and discarding values outside that interval.
-Another option for avoiding these extreme values would be to cap noise
-added at relays by adapting concepts from ($\epsilon,\delta)$-differential
-privacy to the noise-generating code used by relays.%
-\footnote{Whether or not either of these approaches is necessary depends
-on whether or not our extrapolation method can handle outliers.}
-
-\section{Extrapolating hidden-service traffic in the network}
-
-We start the extrapolation of network totals with reported cells on
-rendezvous circuits.
-We do this by summing up all observations per day and dividing by the
-total fraction of observations made by all reporting relays.
-The underlying assumption of this approach is that reported statistics
-grow linearly with calculated fractions.
-Figure~\ref{fig:corr-probs-by-relay}~(left) shows that this is roughly
-the case.
-Figure~\ref{fig:corr-probs-by-day}~(left) shows total reported
-statistics and calculated probabilities per day, and
-Figure~\ref{fig:extrapolated-network-totals}~(bottom) shows extrapolated
-network totals based on daily sums.
+$\infty$.
+Further, relays could lie about hidden-service usage and report very low
+or very high absolute values in their statistics in an attempt to derail
+statistics.
+It seems difficult to define a range of plausible values, and such a range
+might change over time.
+It seems easier to handle these extreme values by treating a certain
+fraction of extrapolated statistics as outliers, which is what we're going
+to do in Section~\ref{sec:averages}.
+
+\section{Extrapolating network totals}
+
+We are now ready to extrapolate network totals from reported statistics.
+We do this by dividing reported statistics by the calculated fraction of
+observations made by the reporting relay.
+The underlying assumption is that statistics grow linearly with calculated
+fractions.
+Figure~\ref{fig:corr-probs-by-relay} shows that this is roughly the case.
 
 \begin{figure}
 \centering
@@ -319,32 +369,9 @@ calculated probability for observing such activity.}
 \label{fig:corr-probs-by-relay}
 \end{figure}
 
-\begin{figure}
-\centering
-\begin{subfigure}{.5\textwidth}
-\centering
-\includegraphics[width=\textwidth]{graphics/corr-probs-cells-by-day.pdf}
-\end{subfigure}%
-\begin{subfigure}{.5\textwidth}
-\centering
-\includegraphics[width=\textwidth]{graphics/corr-probs-onions-by-day.pdf}
-\end{subfigure}%
-\caption{Correlation between the sum of all reports per day and the sum of
-calculated probabilities for observing such activity per day.}
-\label{fig:corr-probs-by-day}
-\end{figure}
-
-\begin{figure}
-\centering
-\includegraphics[width=\textwidth]{graphics/extrapolated-network-totals.pdf}
-\caption{Extrapolated network totals.}
-\label{fig:extrapolated-network-totals}
-\end{figure}
-
-\section{Estimating unique .onion addresses in the network}
-
-Estimating the number of .onion addresses in the network is slightly more
-difficult.
+While we can expect this method to work as described for extrapolating
+cells on rendezvous circuits, we need to take another step for estimating
+the number of unique .onion addresses in the network.
 The reason is that a .onion address is not only known to a single relay,
 but to a couple of relays, all of which include that .onion address in
 their statistics.
@@ -369,49 +396,111 @@ statistics.
 However, for the subsequent analysis, we assume that neither of these
 cases affects results substantially.
 
-Similar to the analysis of hidden-service traffic, we want to compute the
-fraction of hidden-service activity that a directory observes, where
-hidden-service activity means publication of a hidden-service descriptor.
-We define this fraction as the part of descriptor space that the directory
-is responsible for, divided by \emph{three}, because each descriptor
-published to this descriptor is also published to two other directories.
-Note that, without dividing the fraction of a relay's descriptor space by
-three, fractions would not add up to 100\%.
-Figure~\ref{fig:corr-probs-by-relay}~(right) shows the correlation of
-reported .onion addresses and fraction of hidden-service activity.
-
-We can now extrapolate reported unique .onion addresses to network totals:
-we sum up all reported statistics for a given day, divide by the fraction
-of hidden-service activity that we received statistics for on that day,
-and divide the result by twelve, following the assumption from above that
-each service publishes its descriptor to twelve hidden-service
-directories.
-Figure~\ref{fig:corr-probs-by-day}~(right) and
-\ref{fig:extrapolated-network-totals}~(top) show results.
+We can now extrapolate reported unique .onion addresses to network totals.
+Figure~\ref{fig:extrapolated} shows the distributions of extrapolated
+network totals for all days in the analysis period.
 
-\section{Simulating extrapolation methods}
+\begin{figure}
+\centering
+\begin{subfigure}{.5\textwidth}
+\centering
+\includegraphics[width=\textwidth]{graphics/extrapolated-cells.pdf}
+\end{subfigure}%
+\begin{subfigure}{.5\textwidth}
+\centering
+\includegraphics[width=\textwidth]{graphics/extrapolated-onions.pdf}
+\end{subfigure}%
+\caption{Distribution of extrapolated network totals for all days in the
+analysis period, excluding lowest 1\% and highest 1\% for display
+purposes.}
+\label{fig:extrapolated}
+\end{figure}
+
+\section{Selecting daily averages}
+\label{sec:averages}
+
+As last step in the analysis, we aggregate extrapolated network totals for
+a given day to obtain a daily average.
+We considered a few options for calculating the average, each of which
+having their advantages and drawbacks.
+
+We started looking at the \emph{weighted mean} of extrapolated network
+totals, which is the mean of all values but which uses relay fractions as
+weights, so that smaller relays cannot influence the overall result too
+much.
+This metric is equivalent to summing up all reported statistics and
+dividing by the sum of network fractions of reporting relays.
+The nice property of this metric is that it considers all statistics
+reported by relays on a given day.
+But this property is also the biggest disadvantage: single extreme
+statistics can affect the overall result.
+For example, relays that added very large noise values to their statistics
+cannot be filtered out.
+The same holds for relays that lie about their statistics.
+
+Another metric we looked at was the \emph{weighted median}, which also
+takes into account that relays contribute different fractions to the
+overall statistic.
+While this metric is not affected by outliers, basing the daily statistics
+on the data from a single relay doesn't seem very robust.
+
+In the end we decided to pick the \emph{weighted interquartile mean} as
+metric for the daily average.
+For this metric we order extrapolated network totals by their value,
+discard the lower and the upper quartile by weight, and compute the
+weighted mean of the remaining values.
+This metric is robust against noisy statistics and lying relays and
+considers half of the reported statistics.
+
+We further define a threshold of 1\% for the total fraction of relays
+reporting statistics.
+If less than these 1\% of relays report statistics on a given day, we
+don't display that day in the end results.
+Figure~\ref{fig:probs-by-day} shows total calculated network fractions per
+day, and Figure~\ref{fig:extrapolated-network-totals} shows weighted
+interquartile of the extrapolated network totals per day.
+
+\begin{figure}
+\centering
+\includegraphics[width=\textwidth]{graphics/probs-by-day.pdf}
+\caption{Total calculated network fractions per day.}
+\label{fig:probs-by-day}
+\end{figure}
+
+\begin{figure}
+\centering
+\includegraphics[width=\textwidth]{graphics/extrapolated-network-totals.pdf}
+\caption{Daily averages of extrapolated network totals, calculated as
+weighted interquartile means of extrapolations based on statistics by
+single relays.}
+\label{fig:extrapolated-network-totals}
+\end{figure}
+
+\section*{Evaluation}
+
+We conducted two simulations to demonstrate that the extrapolation method
+used here delivers approximately correct results and to gain some sense
+of confidence in the results if only very few relays report
+statistics.
 
-We conducted two simulations to demonstrate that the extrapolation methods
-used here deliver approximately correct results.
 In the first simulation we created a network of 3000 middle relays with
 consensus weights following an exponential distribution.
 We then randomly selected relays as rendezvous points and assigned them,
-in total, $10^9$ cells containing hidden-service traffic.
-Each relay obfuscated its real cell count and reported obfuscated
+in total, $10^9$ cells containing hidden-service traffic in chunks with
+chunk sizes following an exponential distribution with $\lambda=0.0001$.
+Each relay obfuscated its observed cell count and reported obfuscated
 statistics.
 Finally, we picked different fractions of reported statistics and
 extrapolated total cell counts in the network based on these.
-Figure~\ref{fig:sim}~(left) shows the median and the 95\%~confidence
-interval for the extrapolation.
-As long as we included at least 1\% of relays by consensus weight in the
-extrapolation, network totals did not deviate by more than 10\% in
-positive or negative direction.
-
 We also conducted a second simulation with 3000 hidden-service directories
-and 40000 hidden services.
-Similar to the first simulation, Figure~\ref{fig:sim}~(right) shows that
-our extrapolation is roughly accurate if we include statistics from at
-least 1\% of hidden-service directories.
+and 40000 hidden services, each of them publishing descriptors to 12
+directories.
+
+Figure~\ref{fig:sim} shows the median and the range between 2.5th and
+97.5th percentile for the extrapolation.
+As long as we included at least 1\% of relays by consensus weight in the
+extrapolation, network totals did not deviate by more than 5\% in positive
+or negative direction.
 
 \begin{figure}
 \centering
@@ -423,26 +512,25 @@ least 1\% of hidden-service directories.
 \centering
 \includegraphics[width=\textwidth]{graphics/sim-onions.pdf}
 \end{subfigure}%
-\caption{Median and confidence interval of simulated extrapolations.}
+\caption{Median and range from 2.5th to 97.5th percentile of simulated
+extrapolations.}
 \label{fig:sim}
 \end{figure}
 
-\section{Open questions}
+\section*{Conclusion}
 
-\begin{itemize}
-\item Maybe we should switch back to the first extrapolation method, where
-we're extrapolating from single observations, and then take the weighted
-mean as best extrapolation result.
-This has some advantages for handling outliers.
-We'll want to run new simulations using this method.
-\item The ribbon in Figure~\ref{fig:extrapolated-network-totals} implies a
-confidence interval of some sort, but it's really only the standard error
-of the local regression algorithm added by the graphing software.
-We should instead calculate the confidence interval of our extrapolation,
-similar to the simulation, and graph that.
-One option might be to run simulations as part of the extrapolation
-process.
-\end{itemize}
+In this report we described a method for extrapolating network totals from
+the two recently added hidden-service statistics.
+We showed that we can extrapolate network totals with reasonable accuracy
+as long as at least 1\% of relays report these statistics.
+
+\section*{Acknowledgements}
+
+Thanks to Aaron Johnson for providing invaluable feedback on extrapolating
+statistics and on running simulations.
+Thanks to the relay operators who enabled the new hidden-service
+statistics on their relays and provided us with the data to write this
+report.
 
 \end{document}

    

karsten＠torproject.org

tags

participants (1)