commit 6511b8b34d30f63c370e369458149b054b423934
Author: Karsten Loesing <karsten.loesing(a)gmx.net>
Date: Sat Jan 17 19:21:49 2015 +0100
Add hidserv-stats extrapolation tech report draft.
---
2015/extrapolating-hidserv-stats/.gitignore | 2 +
.../extrapolating-hidserv-stats.tex | 424 ++++++++++++++++++++
2015/extrapolating-hidserv-stats/tortechrep.cls | 1 +
3 files changed, 427 insertions(+)
diff --git a/2015/extrapolating-hidserv-stats/.gitignore b/2015/extrapolating-hidserv-stats/.gitignore
new file mode 100644
index 0000000..75d2c50
--- /dev/null
+++ b/2015/extrapolating-hidserv-stats/.gitignore
@@ -0,0 +1,2 @@
+extrapolating-hidserv-stats.pdf
+
diff --git a/2015/extrapolating-hidserv-stats/extrapolating-hidserv-stats.tex b/2015/extrapolating-hidserv-stats/extrapolating-hidserv-stats.tex
new file mode 100644
index 0000000..60507a0
--- /dev/null
+++ b/2015/extrapolating-hidserv-stats/extrapolating-hidserv-stats.tex
@@ -0,0 +1,424 @@
+\documentclass{tortechrep}
+\usepackage{graphicx}
+\usepackage{subcaption}
+\usepackage{url}
+\begin{document}
+
+\title{Extrapolating network totals from hidden-service statistics}
+
+\author{yet unnamed authors}
+
+\reportid{DRAFT}
+\date{to be published in January 2015}
+
+\maketitle
+
+\begin{abstract}
+Starting on December 19, 2014, we added two new statistics to the Tor
+software that shall give us some first insights into hidden-service usage.
+The first is the number of .onion addresses observed by a hidden-service
+directory, and the second is the number of cells on rendezvous circuits
+observed by a rendezvous point.
+Each relay that opts in to reporting these statistics publishes these two
+numbers for 24-hour intervals of operation.
+In the following, we explain our approach for extrapolating network totals
+from these statistics.
+The goal is to learn how many unique .onion addresses exist in the network
+and what amount of traffic can be attributed to hidden-service usage.
+We show that we can extrapolate network totals from hidden-service
+statistics with reasonable accuracy as long as at least 1\% of relays
+report these statistics.
+\end{abstract}
+
+\section{Parsing reported statistics}
+
+There are two types of documents produced by Tor relays that we consider
+in our analysis.
+The first are extra-info descriptors that contain hidden-service
+statistics if a relay opts in to reporting them.
+The second are consensuses that indicate what fraction of hidden-service
+descriptors a hidden-service directory has observed and what fraction of
+rendezvous circuits a relay has handled.
+
+% SAMPLE:
+% fingerprint F528DED21EACD2E4E9301EC0AABD370EDCAD2C47
+% stats_start 2014-12-31 16:17:33
+% stats_end 2015-01-01 16:17:33
+% hidserv_rend_relayed_cells 152599040
+% hidserv_dir_onions_seen 84
+% prob_rend_point 0.01509326
+% frac_hsdesc 0.00069757
+
+\begin{figure}[b]
+\begin{verbatim}
+extra-info ryroConoha F528DED21EACD2E4E9301EC0AABD370EDCAD2C47
+[...]
+hidserv-stats-end 2015-01-01 16:17:33 (86400 s)
+hidserv-rend-relayed-cells 152599508 delta_f=2048 epsilon=0.30 bin_size=1024
+hidserv-dir-onions-seen 91 delta_f=8 epsilon=0.30 bin_size=8
+\end{verbatim}
+\caption{Sample hidden-service statistics contained in an extra-info
+descriptor.}
+\label{fig:extrainfo}
+\end{figure}
+
+We start by describing how we're parsing and processing hidden-service
+statistics from extra-info descriptors.
+Figure~\ref{fig:extrainfo} shows a sample of hidden-service statistics as
+contained in extra-info descriptors.
+The relevant parts for this analysis are:
+
+\begin{itemize}
+\item The \verb+extra-info+ line tells us which relay reported these
+statistics, which we need to know to derive what fraction of
+hidden-service activity this relay has observed.
+\item The \verb+hidserv-stats-end+ line tells us when the statistics
+interval ended, and, together with the interval length, when it started.
+\item The \verb+hidserv-rend-relayed-cells+ line tells us the number of
+cells that the relay handled on rendezvous circuits, and it tells us how
+this number has been obfuscated by the relay.
+The value for \verb+bin_size+ is the bin size used for rounding up the
+originally observed cell number, and the values for \verb+delta_f+ and
+\verb+epsilon+ are inputs for the additive noise following a Laplace
+distribution.
+\item And finally, the \verb+hidserv-dir-onions-seen+ line tells us the
+number of .onion addresses that the relay observed in published
+hidden-service descriptors in its role as hidden-service directory.
+\end{itemize}
+
+\begin{figure}
+\centering
+\includegraphics[width=\textwidth]{graphics/num-reported-stats.pdf}
+\caption{Number of relays reporting hidden-service statistics.}
+\label{fig:num-reported-stats}
+\end{figure}
+
+Figure~\ref{fig:num-reported-stats} shows the number of statistics
+reported by day.
+
+\section{Removing previously added noise}
+
+When processing hidden-service statistics, we need to handle the fact that
+they have been obfuscated by relays.
+As first step, we're attempting to remove the additive Laplace-distributed
+noise by rounding up or down to the nearest multiple of \verb+bin_size+.
+The idea is that it's most likely that noise was added to the closest
+right side of a bin than to the right side of another bin.
+In step two, we're subtracting half of \verb+bin_size+, because the relay
+added between 0 and \verb+bin_size+$ - 1$ to the originally observed
+value.
+
+Following these steps, the statistics reported in
+Figure~\ref{fig:extrainfo} are processed to 152599040~cells and 84~.onion
+addresses.
+For the subsequent analysis we're also converting cells/day to
+bytes/second by multiplying cell numbers with 512~bytes/cell, dividing by
+86400~seconds/day, and dividing by 2 to account for the fact that
+statistics include cells in both incoming and outgoing direction.
+As a result we obtain 452~KB/s in the given sample.
+
+Figure~\ref{fig:stats-by-day} shows parsed values after removing
+previously added noise.
+Negative values are the result of relays adding negative
+Laplace-distributed noise values to very small observed values.
+We will describe an attempt to remove such values shortly.
+
+\begin{figure}
+\centering
+\includegraphics[width=\textwidth]{graphics/stats-by-day.pdf}
+\caption{Hidden-service statistics reported by single relays after
+removing noise that was added by relays.
+Negative values are the result of relays adding negative
+Laplace-distributed noise values to very small observed values.}
+\label{fig:stats-by-day}
+\end{figure}
+
+\section{Deriving network fractions from consensuses}
+
+The second document type that we consider in our analysis are consensuses.
+Not all hidden-service directories observe the same number of
+hidden-service descriptors, and the probability of chosing a relay as
+rendezvous point is even less uniformly distributed.
+Fortunately, we can derive what fraction of descriptors a directory was
+responsible for and what fraction of rendezvous circuits a relay has
+handled.
+
+\begin{figure}
+\begin{verbatim}
+valid-after 2015-01-01 12:00:00
+[...]
+r adc 9PodlaV/b092ts4wpVLjq5d6mOs [...]
+s Fast HSDir Running Stable V2Dir Valid
+[...]
+r AlongProxy 9QtBPk7CqqYP7WHNJJfgSVpJcyo [...]
+s Fast HSDir Running Stable V2Dir Valid
+[...]
+r horizons3 9ScwmKcR+EXl0aJPnTj5O4ag8iA [...]
+s Fast HSDir Running Stable V2Dir Valid
+[...]
+r ryroConoha 9Sje0h6s0uTpMB7Aqr03DtytLEc [...]
+s Fast HSDir Running Stable V2Dir Valid
+v Tor 0.2.6.1-alpha-dev
+w Bandwidth=117000
+p reject 1-65535
+[...]
+bandwidth-weights [...] Wmd=0 Wme=0 Wmg=3701 Wmm=10000
+\end{verbatim}
+\caption{Sample consensus entries of relay \texttt{ryroConoha} that
+reports hidden-service statistics and of the three hidden-service
+directories preceding it.}
+\label{fig:consensusentry}
+\end{figure}
+
+Figure~\ref{fig:consensusentry} shows the consensus entry of the relay
+that submitted the sample hidden-service statistics mentioned above.
+
+The first fraction that we can derive from this entry is the fraction of
+descriptor space that this relay was responsible for in its role as
+hidden-service directory.
+The Tor Rendezvous
+Specification\footnote{\url{https://gitweb.torproject.org/torspec.git/tree/rend-spec.txt}}
+contains the following definition that is relevant here:
+
+\begin{quote}
+\textit{A hidden service directory is deemed responsible for a descriptor
+ID if it has the HSDir flag and its identity digest is one of the first
+three identity digests of HSDir relays following the descriptor ID in a
+circular list.}
+\end{quote}
+
+In the sample consensus entry, we'd extract the base64-encoded fingerprint
+of the statistics-reporting relay, \verb+9Sje0h6...+, and the fingerprint
+of the hidden-service directory that precedes the relay by three
+positions, \verb+9PodlaV...+, and compute what fraction of descriptor
+space that is, in this case $0.07\%$.
+
+The second fraction that we compute is the probability of a relay to be
+selected as rendezvous point.
+Clients select only relays with the \verb+Fast+ and in some cases the
+\verb+Stable+ flag, and they weigh relays differently based on their
+bandwidth and depending on whether they have the \verb+Exit+ and/or
+\verb+Guard+ flags.
+(Clients further require relays to have the \verb+Stable+ flag if they
+attempt to establish a long-running connection, e.g., to a hidden SSH
+server, but in the following analysis, we assume that most clients
+establish connections that don't need to last for long, e.g., to a hidden
+webserver.)
+Clients weigh the bandwidth value contained in the consensus with the
+value of \verb+Wmg+, \verb+Wme+, \verb+Wmd+, or \verb+Wmm+, depending on
+whether the relay has only the \verb+Guard+ flag, only the \verb+Exit+
+flag, both such flags, or neither of them.
+
+Our sample relay has the \verb+Fast+ flag, a bandwidth value of 117,000,
+and neither \verb+Guard+ nor \verb+Exit+ flag.
+Its probability for being selected as rendezvous point is calculated as
+$117000 \times 10000/10000$ divided by the sum of all such weights in the
+consensus, in this case $1.42\%$
+
+\begin{figure}
+\centering
+\includegraphics[width=\textwidth]{graphics/probs-by-relay.pdf}
+\caption{Calculated probabilities for observing hidden-service activity.}
+\label{fig:probs-by-relay}
+\end{figure}
+
+Figure~\ref{fig:probs-by-relay} shows calculated probabilities of
+observing hidden-service activities of relays reporting hidden-service
+statistics.
+That figure shows that most relays have roughly the same (small)
+probability for observing a hidden-service descriptor with only few
+outliers.
+The probability for being selected as rendezvous point is much smaller for
+most relays, with only the outliers having a realistic chance of being
+selected.
+
+\section{Removing implausible statistics}
+
+A relay that opts in to gathering hidden-service statistics reports them
+even if it couldn't plausibly have observed them.
+In particular, a relay that did not have the \verb+HSDir+ flag could not
+have observed a single .onion address, and a relay with the \verb+Exit+
+flag could not have been selected as rendezvous point as long as
+\verb+Wmd+ and \verb+Wme+ are zero.
+Figure~\ref{fig:zero} shows distributions of reported statistics of relays
+with calculated probabilities of exactly zero.
+These reported values approximately follow the plotted Laplace
+distributions with $\mu=0$ and $b=2048/0.3$ or $b=8/0.3$ as defined for
+the respective statistics.
+We can assume that the vast majority of these reported values are just
+noise.
+In the following analysis, we exclude relays with calculated probabilities
+of exactly 0.
+
+\begin{figure}
+\centering
+\begin{subfigure}{.5\textwidth}
+\centering
+\includegraphics[width=\textwidth]{graphics/zero-prob-cells.pdf}
+\end{subfigure}%
+\begin{subfigure}{.5\textwidth}
+\centering
+\includegraphics[width=\textwidth]{graphics/zero-prob-onions.pdf}
+\end{subfigure}
+\caption{Statistics reported by relays with calculated probabilities of
+observing these statistics of zero.
+The blue lines show Laplace distributions with $\mu=0$ and $b=2048/0.3$ or
+$b=8/0.3$ as defined for the respective statistics.}
+\label{fig:zero}
+\end{figure}
+
+\section{Extrapolating hidden-service traffic in the network}
+
+We start the extrapolation of network totals with reported cells on
+rendezvous circuits.
+We do this by summing up all observations per day and dividing by the
+total fraction of observations made by all reporting relays.
+The underlying assumption of this approach is that reported statistics
+grow linearly with calculated fractions.
+Figure~\ref{fig:corr-probs-by-relay}~(left) shows that this is roughly
+the case.
+Figure~\ref{fig:corr-probs-by-day}~(left) shows total reported
+statistics and calculated probabilities per day, and
+Figure~\ref{fig:extrapolated-network-totals}~(bottom) shows extrapolated
+network totals based on daily sums.
+
+\begin{figure}
+\centering
+\begin{subfigure}{.5\textwidth}
+\centering
+\includegraphics[width=\textwidth]{graphics/corr-probs-cells-by-relay.pdf}
+\end{subfigure}%
+\begin{subfigure}{.5\textwidth}
+\centering
+\includegraphics[width=\textwidth]{graphics/corr-probs-onions-by-relay.pdf}
+\end{subfigure}%
+\caption{Correlation between reported hidden-service activity and
+calculated probability for observing such activity.}
+\label{fig:corr-probs-by-relay}
+\end{figure}
+
+\begin{figure}
+\centering
+\begin{subfigure}{.5\textwidth}
+\centering
+\includegraphics[width=\textwidth]{graphics/corr-probs-cells-by-day.pdf}
+\end{subfigure}%
+\begin{subfigure}{.5\textwidth}
+\centering
+\includegraphics[width=\textwidth]{graphics/corr-probs-onions-by-day.pdf}
+\end{subfigure}%
+\caption{Correlation between the sum of all reports per day and the sum of
+calculated probabilities for observing such activity per day.}
+\label{fig:corr-probs-by-day}
+\end{figure}
+
+\begin{figure}
+\centering
+\includegraphics[width=\textwidth]{graphics/extrapolated-network-totals.pdf}
+\caption{Extrapolated network totals.}
+\label{fig:extrapolated-network-totals}
+\end{figure}
+
+\section{Estimating unique .onion addresses in the network}
+
+Estimating the number of .onion addresses in the network is slightly more
+difficult.
+The reason is that a .onion address is not only known to a single relay,
+but to a couple of relays, all of which include that .onion address in
+their statistics.
+We need to subtract out the multiple counting of .onion addresses to come
+up with a network-wide number of unique .onion addresses.
+
+As an approximation, we assume that a hidden service publishes its
+descriptor to \emph{twelve} directories over a 24-hour period:
+the service stores \emph{two} replicas per descriptor using different
+descriptor identifiers, both descriptor replicas get stored to
+\emph{three} different hidden-service directories each, and the service
+changes descriptor identifiers once every 24 hours which leads to
+\emph{two} different descriptor identifiers per replica.
+
+To be clear, this approximation is not entirely accurate.
+For example, the two replicas or the descriptors with changed descriptor
+identifiers could have been stored to the same directory.
+As another example, hidden service directories might have joined or left
+the network and other directories might have become responsible for
+storing a descriptor which also include that .onion address in their
+statistics.
+However, for the subsequent analysis, we assume that neither of these
+cases affects results substantially.
+
+Similar to the analysis of hidden-service traffic, we want to compute the
+fraction of hidden-service activity that a directory observes, where
+hidden-service activity means publication of a hidden-service descriptor.
+We define this fraction as the part of descriptor space that the directory
+is responsible for, divided by \emph{three}, because each descriptor
+published to this descriptor is also published to two other directories.
+Figure~\ref{fig:corr-probs-by-relay}~(right) shows the correlation of
+reported .onion addresses and fraction of hidden-service activity.
+
+We can now extrapolate reported unique .onion addresses to network totals:
+we sum up all reported statistics for a given day, divide by the fraction
+of hidden-service activity that we received statistics for on that day,
+and divide the result by twelve, following the assumption from above that
+each service publishes its descriptor to twelve hidden-service
+directories.
+Figure~\ref{fig:corr-probs-by-day}~(right) and
+\ref{fig:extrapolated-network-totals}~(top) show results.
+
+\section{Simulating extrapolation methods}
+
+We conducted two simulations to demonstrate that the extrapolation methods
+used here deliver approximately correct results.
+In the first simulation we created a network of 3000 middle relays with
+consensus weights following an exponential distribution.
+We then randomly selected relays as rendezvous points and assigned them,
+in total, $10^9$ cells containing hidden-service traffic.
+Each relay obfuscated its real cell count and reported obfuscated
+statistics.
+Finally, we picked different fractions of reported statistics and
+extrapolated total cell counts in the network based on these.
+Figure~\ref{fig:sim}~(left) shows the median and the 95\%~confidence
+interval for the extrapolation.
+As long as we included at least 1\% of relays by consensus weight in the
+extrapolation, network totals did not deviate by more than 10\% in
+positive or negative direction.
+
+We also conducted a second simulation with 3000 hidden-service directories
+and 40000 hidden services.
+Similar to the first simulation, Figure~\ref{fig:sim}~(right) shows that
+our extrapolation is roughly accurate if we include statistics from at
+least 1\% of hidden-service directories.
+
+\begin{figure}
+\centering
+\begin{subfigure}{.5\textwidth}
+\centering
+\includegraphics[width=\textwidth]{graphics/sim-cells.pdf}
+\end{subfigure}%
+\begin{subfigure}{.5\textwidth}
+\centering
+\includegraphics[width=\textwidth]{graphics/sim-onions.pdf}
+\end{subfigure}%
+\caption{Median and confidence interval of simulated extrapolations.}
+\label{fig:sim}
+\end{figure}
+
+\section{Open questions}
+
+\begin{itemize}
+\item The Laplace noise added by a single relay may range from $-\infty$
+to $\infty$ and therefore possibly derail statistics for the entire day.
+Maybe, as part of removing implausible statistics, we should calculate the
+ratio between reported value and calculated probability (see
+Figure~\ref{fig:corr-probs-by-relay}) and exclude any outliers before the
+extrapolation step.
+\item The ribbon in Figure~\ref{fig:extrapolated-network-totals} implies a
+confidence interval of some sort, but it's really only the standard error
+of the local regression algorithm added by the graphing software.
+We should instead calculate the confidence interval of our extrapolation,
+similar to the simulation, and graph that.
+But how do we calculate that?
+\end{itemize}
+
+\end{document}
+
diff --git a/2015/extrapolating-hidserv-stats/tortechrep.cls b/2015/extrapolating-hidserv-stats/tortechrep.cls
new file mode 120000
index 0000000..4c24db2
--- /dev/null
+++ b/2015/extrapolating-hidserv-stats/tortechrep.cls
@@ -0,0 +1 @@
+../../tortechrep.cls
\ No newline at end of file