commit 6511b8b34d30f63c370e369458149b054b423934 Author: Karsten Loesing karsten.loesing@gmx.net Date: Sat Jan 17 19:21:49 2015 +0100
Add hidserv-stats extrapolation tech report draft. --- 2015/extrapolating-hidserv-stats/.gitignore | 2 + .../extrapolating-hidserv-stats.tex | 424 ++++++++++++++++++++ 2015/extrapolating-hidserv-stats/tortechrep.cls | 1 + 3 files changed, 427 insertions(+)
diff --git a/2015/extrapolating-hidserv-stats/.gitignore b/2015/extrapolating-hidserv-stats/.gitignore new file mode 100644 index 0000000..75d2c50 --- /dev/null +++ b/2015/extrapolating-hidserv-stats/.gitignore @@ -0,0 +1,2 @@ +extrapolating-hidserv-stats.pdf + diff --git a/2015/extrapolating-hidserv-stats/extrapolating-hidserv-stats.tex b/2015/extrapolating-hidserv-stats/extrapolating-hidserv-stats.tex new file mode 100644 index 0000000..60507a0 --- /dev/null +++ b/2015/extrapolating-hidserv-stats/extrapolating-hidserv-stats.tex @@ -0,0 +1,424 @@ +\documentclass{tortechrep} +\usepackage{graphicx} +\usepackage{subcaption} +\usepackage{url} +\begin{document} + +\title{Extrapolating network totals from hidden-service statistics} + +\author{yet unnamed authors} + +\reportid{DRAFT} +\date{to be published in January 2015} + +\maketitle + +\begin{abstract} +Starting on December 19, 2014, we added two new statistics to the Tor +software that shall give us some first insights into hidden-service usage. +The first is the number of .onion addresses observed by a hidden-service +directory, and the second is the number of cells on rendezvous circuits +observed by a rendezvous point. +Each relay that opts in to reporting these statistics publishes these two +numbers for 24-hour intervals of operation. +In the following, we explain our approach for extrapolating network totals +from these statistics. +The goal is to learn how many unique .onion addresses exist in the network +and what amount of traffic can be attributed to hidden-service usage. +We show that we can extrapolate network totals from hidden-service +statistics with reasonable accuracy as long as at least 1% of relays +report these statistics. +\end{abstract} + +\section{Parsing reported statistics} + +There are two types of documents produced by Tor relays that we consider +in our analysis. +The first are extra-info descriptors that contain hidden-service +statistics if a relay opts in to reporting them. +The second are consensuses that indicate what fraction of hidden-service +descriptors a hidden-service directory has observed and what fraction of +rendezvous circuits a relay has handled. + +% SAMPLE: +% fingerprint F528DED21EACD2E4E9301EC0AABD370EDCAD2C47 +% stats_start 2014-12-31 16:17:33 +% stats_end 2015-01-01 16:17:33 +% hidserv_rend_relayed_cells 152599040 +% hidserv_dir_onions_seen 84 +% prob_rend_point 0.01509326 +% frac_hsdesc 0.00069757 + +\begin{figure}[b] +\begin{verbatim} +extra-info ryroConoha F528DED21EACD2E4E9301EC0AABD370EDCAD2C47 +[...] +hidserv-stats-end 2015-01-01 16:17:33 (86400 s) +hidserv-rend-relayed-cells 152599508 delta_f=2048 epsilon=0.30 bin_size=1024 +hidserv-dir-onions-seen 91 delta_f=8 epsilon=0.30 bin_size=8 +\end{verbatim} +\caption{Sample hidden-service statistics contained in an extra-info +descriptor.} +\label{fig:extrainfo} +\end{figure} + +We start by describing how we're parsing and processing hidden-service +statistics from extra-info descriptors. +Figure~\ref{fig:extrainfo} shows a sample of hidden-service statistics as +contained in extra-info descriptors. +The relevant parts for this analysis are: + +\begin{itemize} +\item The \verb+extra-info+ line tells us which relay reported these +statistics, which we need to know to derive what fraction of +hidden-service activity this relay has observed. +\item The \verb+hidserv-stats-end+ line tells us when the statistics +interval ended, and, together with the interval length, when it started. +\item The \verb+hidserv-rend-relayed-cells+ line tells us the number of +cells that the relay handled on rendezvous circuits, and it tells us how +this number has been obfuscated by the relay. +The value for \verb+bin_size+ is the bin size used for rounding up the +originally observed cell number, and the values for \verb+delta_f+ and +\verb+epsilon+ are inputs for the additive noise following a Laplace +distribution. +\item And finally, the \verb+hidserv-dir-onions-seen+ line tells us the +number of .onion addresses that the relay observed in published +hidden-service descriptors in its role as hidden-service directory. +\end{itemize} + +\begin{figure} +\centering +\includegraphics[width=\textwidth]{graphics/num-reported-stats.pdf} +\caption{Number of relays reporting hidden-service statistics.} +\label{fig:num-reported-stats} +\end{figure} + +Figure~\ref{fig:num-reported-stats} shows the number of statistics +reported by day. + +\section{Removing previously added noise} + +When processing hidden-service statistics, we need to handle the fact that +they have been obfuscated by relays. +As first step, we're attempting to remove the additive Laplace-distributed +noise by rounding up or down to the nearest multiple of \verb+bin_size+. +The idea is that it's most likely that noise was added to the closest +right side of a bin than to the right side of another bin. +In step two, we're subtracting half of \verb+bin_size+, because the relay +added between 0 and \verb+bin_size+$ - 1$ to the originally observed +value. + +Following these steps, the statistics reported in +Figure~\ref{fig:extrainfo} are processed to 152599040~cells and 84~.onion +addresses. +For the subsequent analysis we're also converting cells/day to +bytes/second by multiplying cell numbers with 512~bytes/cell, dividing by +86400~seconds/day, and dividing by 2 to account for the fact that +statistics include cells in both incoming and outgoing direction. +As a result we obtain 452~KB/s in the given sample. + +Figure~\ref{fig:stats-by-day} shows parsed values after removing +previously added noise. +Negative values are the result of relays adding negative +Laplace-distributed noise values to very small observed values. +We will describe an attempt to remove such values shortly. + +\begin{figure} +\centering +\includegraphics[width=\textwidth]{graphics/stats-by-day.pdf} +\caption{Hidden-service statistics reported by single relays after +removing noise that was added by relays. +Negative values are the result of relays adding negative +Laplace-distributed noise values to very small observed values.} +\label{fig:stats-by-day} +\end{figure} + +\section{Deriving network fractions from consensuses} + +The second document type that we consider in our analysis are consensuses. +Not all hidden-service directories observe the same number of +hidden-service descriptors, and the probability of chosing a relay as +rendezvous point is even less uniformly distributed. +Fortunately, we can derive what fraction of descriptors a directory was +responsible for and what fraction of rendezvous circuits a relay has +handled. + +\begin{figure} +\begin{verbatim} +valid-after 2015-01-01 12:00:00 +[...] +r adc 9PodlaV/b092ts4wpVLjq5d6mOs [...] +s Fast HSDir Running Stable V2Dir Valid +[...] +r AlongProxy 9QtBPk7CqqYP7WHNJJfgSVpJcyo [...] +s Fast HSDir Running Stable V2Dir Valid +[...] +r horizons3 9ScwmKcR+EXl0aJPnTj5O4ag8iA [...] +s Fast HSDir Running Stable V2Dir Valid +[...] +r ryroConoha 9Sje0h6s0uTpMB7Aqr03DtytLEc [...] +s Fast HSDir Running Stable V2Dir Valid +v Tor 0.2.6.1-alpha-dev +w Bandwidth=117000 +p reject 1-65535 +[...] +bandwidth-weights [...] Wmd=0 Wme=0 Wmg=3701 Wmm=10000 +\end{verbatim} +\caption{Sample consensus entries of relay \texttt{ryroConoha} that +reports hidden-service statistics and of the three hidden-service +directories preceding it.} +\label{fig:consensusentry} +\end{figure} + +Figure~\ref{fig:consensusentry} shows the consensus entry of the relay +that submitted the sample hidden-service statistics mentioned above. + +The first fraction that we can derive from this entry is the fraction of +descriptor space that this relay was responsible for in its role as +hidden-service directory. +The Tor Rendezvous +Specification\footnote{\url{https://gitweb.torproject.org/torspec.git/tree/rend-spec.txt%7D%7D +contains the following definition that is relevant here: + +\begin{quote} +\textit{A hidden service directory is deemed responsible for a descriptor +ID if it has the HSDir flag and its identity digest is one of the first +three identity digests of HSDir relays following the descriptor ID in a +circular list.} +\end{quote} + +In the sample consensus entry, we'd extract the base64-encoded fingerprint +of the statistics-reporting relay, \verb+9Sje0h6...+, and the fingerprint +of the hidden-service directory that precedes the relay by three +positions, \verb+9PodlaV...+, and compute what fraction of descriptor +space that is, in this case $0.07%$. + +The second fraction that we compute is the probability of a relay to be +selected as rendezvous point. +Clients select only relays with the \verb+Fast+ and in some cases the +\verb+Stable+ flag, and they weigh relays differently based on their +bandwidth and depending on whether they have the \verb+Exit+ and/or +\verb+Guard+ flags. +(Clients further require relays to have the \verb+Stable+ flag if they +attempt to establish a long-running connection, e.g., to a hidden SSH +server, but in the following analysis, we assume that most clients +establish connections that don't need to last for long, e.g., to a hidden +webserver.) +Clients weigh the bandwidth value contained in the consensus with the +value of \verb+Wmg+, \verb+Wme+, \verb+Wmd+, or \verb+Wmm+, depending on +whether the relay has only the \verb+Guard+ flag, only the \verb+Exit+ +flag, both such flags, or neither of them. + +Our sample relay has the \verb+Fast+ flag, a bandwidth value of 117,000, +and neither \verb+Guard+ nor \verb+Exit+ flag. +Its probability for being selected as rendezvous point is calculated as +$117000 \times 10000/10000$ divided by the sum of all such weights in the +consensus, in this case $1.42%$ + +\begin{figure} +\centering +\includegraphics[width=\textwidth]{graphics/probs-by-relay.pdf} +\caption{Calculated probabilities for observing hidden-service activity.} +\label{fig:probs-by-relay} +\end{figure} + +Figure~\ref{fig:probs-by-relay} shows calculated probabilities of +observing hidden-service activities of relays reporting hidden-service +statistics. +That figure shows that most relays have roughly the same (small) +probability for observing a hidden-service descriptor with only few +outliers. +The probability for being selected as rendezvous point is much smaller for +most relays, with only the outliers having a realistic chance of being +selected. + +\section{Removing implausible statistics} + +A relay that opts in to gathering hidden-service statistics reports them +even if it couldn't plausibly have observed them. +In particular, a relay that did not have the \verb+HSDir+ flag could not +have observed a single .onion address, and a relay with the \verb+Exit+ +flag could not have been selected as rendezvous point as long as +\verb+Wmd+ and \verb+Wme+ are zero. +Figure~\ref{fig:zero} shows distributions of reported statistics of relays +with calculated probabilities of exactly zero. +These reported values approximately follow the plotted Laplace +distributions with $\mu=0$ and $b=2048/0.3$ or $b=8/0.3$ as defined for +the respective statistics. +We can assume that the vast majority of these reported values are just +noise. +In the following analysis, we exclude relays with calculated probabilities +of exactly 0. + +\begin{figure} +\centering +\begin{subfigure}{.5\textwidth} +\centering +\includegraphics[width=\textwidth]{graphics/zero-prob-cells.pdf} +\end{subfigure}% +\begin{subfigure}{.5\textwidth} +\centering +\includegraphics[width=\textwidth]{graphics/zero-prob-onions.pdf} +\end{subfigure} +\caption{Statistics reported by relays with calculated probabilities of +observing these statistics of zero. +The blue lines show Laplace distributions with $\mu=0$ and $b=2048/0.3$ or +$b=8/0.3$ as defined for the respective statistics.} +\label{fig:zero} +\end{figure} + +\section{Extrapolating hidden-service traffic in the network} + +We start the extrapolation of network totals with reported cells on +rendezvous circuits. +We do this by summing up all observations per day and dividing by the +total fraction of observations made by all reporting relays. +The underlying assumption of this approach is that reported statistics +grow linearly with calculated fractions. +Figure~\ref{fig:corr-probs-by-relay}~(left) shows that this is roughly +the case. +Figure~\ref{fig:corr-probs-by-day}~(left) shows total reported +statistics and calculated probabilities per day, and +Figure~\ref{fig:extrapolated-network-totals}~(bottom) shows extrapolated +network totals based on daily sums. + +\begin{figure} +\centering +\begin{subfigure}{.5\textwidth} +\centering +\includegraphics[width=\textwidth]{graphics/corr-probs-cells-by-relay.pdf} +\end{subfigure}% +\begin{subfigure}{.5\textwidth} +\centering +\includegraphics[width=\textwidth]{graphics/corr-probs-onions-by-relay.pdf} +\end{subfigure}% +\caption{Correlation between reported hidden-service activity and +calculated probability for observing such activity.} +\label{fig:corr-probs-by-relay} +\end{figure} + +\begin{figure} +\centering +\begin{subfigure}{.5\textwidth} +\centering +\includegraphics[width=\textwidth]{graphics/corr-probs-cells-by-day.pdf} +\end{subfigure}% +\begin{subfigure}{.5\textwidth} +\centering +\includegraphics[width=\textwidth]{graphics/corr-probs-onions-by-day.pdf} +\end{subfigure}% +\caption{Correlation between the sum of all reports per day and the sum of +calculated probabilities for observing such activity per day.} +\label{fig:corr-probs-by-day} +\end{figure} + +\begin{figure} +\centering +\includegraphics[width=\textwidth]{graphics/extrapolated-network-totals.pdf} +\caption{Extrapolated network totals.} +\label{fig:extrapolated-network-totals} +\end{figure} + +\section{Estimating unique .onion addresses in the network} + +Estimating the number of .onion addresses in the network is slightly more +difficult. +The reason is that a .onion address is not only known to a single relay, +but to a couple of relays, all of which include that .onion address in +their statistics. +We need to subtract out the multiple counting of .onion addresses to come +up with a network-wide number of unique .onion addresses. + +As an approximation, we assume that a hidden service publishes its +descriptor to \emph{twelve} directories over a 24-hour period: +the service stores \emph{two} replicas per descriptor using different +descriptor identifiers, both descriptor replicas get stored to +\emph{three} different hidden-service directories each, and the service +changes descriptor identifiers once every 24 hours which leads to +\emph{two} different descriptor identifiers per replica. + +To be clear, this approximation is not entirely accurate. +For example, the two replicas or the descriptors with changed descriptor +identifiers could have been stored to the same directory. +As another example, hidden service directories might have joined or left +the network and other directories might have become responsible for +storing a descriptor which also include that .onion address in their +statistics. +However, for the subsequent analysis, we assume that neither of these +cases affects results substantially. + +Similar to the analysis of hidden-service traffic, we want to compute the +fraction of hidden-service activity that a directory observes, where +hidden-service activity means publication of a hidden-service descriptor. +We define this fraction as the part of descriptor space that the directory +is responsible for, divided by \emph{three}, because each descriptor +published to this descriptor is also published to two other directories. +Figure~\ref{fig:corr-probs-by-relay}~(right) shows the correlation of +reported .onion addresses and fraction of hidden-service activity. + +We can now extrapolate reported unique .onion addresses to network totals: +we sum up all reported statistics for a given day, divide by the fraction +of hidden-service activity that we received statistics for on that day, +and divide the result by twelve, following the assumption from above that +each service publishes its descriptor to twelve hidden-service +directories. +Figure~\ref{fig:corr-probs-by-day}~(right) and +\ref{fig:extrapolated-network-totals}~(top) show results. + +\section{Simulating extrapolation methods} + +We conducted two simulations to demonstrate that the extrapolation methods +used here deliver approximately correct results. +In the first simulation we created a network of 3000 middle relays with +consensus weights following an exponential distribution. +We then randomly selected relays as rendezvous points and assigned them, +in total, $10^9$ cells containing hidden-service traffic. +Each relay obfuscated its real cell count and reported obfuscated +statistics. +Finally, we picked different fractions of reported statistics and +extrapolated total cell counts in the network based on these. +Figure~\ref{fig:sim}~(left) shows the median and the 95%~confidence +interval for the extrapolation. +As long as we included at least 1% of relays by consensus weight in the +extrapolation, network totals did not deviate by more than 10% in +positive or negative direction. + +We also conducted a second simulation with 3000 hidden-service directories +and 40000 hidden services. +Similar to the first simulation, Figure~\ref{fig:sim}~(right) shows that +our extrapolation is roughly accurate if we include statistics from at +least 1% of hidden-service directories. + +\begin{figure} +\centering +\begin{subfigure}{.5\textwidth} +\centering +\includegraphics[width=\textwidth]{graphics/sim-cells.pdf} +\end{subfigure}% +\begin{subfigure}{.5\textwidth} +\centering +\includegraphics[width=\textwidth]{graphics/sim-onions.pdf} +\end{subfigure}% +\caption{Median and confidence interval of simulated extrapolations.} +\label{fig:sim} +\end{figure} + +\section{Open questions} + +\begin{itemize} +\item The Laplace noise added by a single relay may range from $-\infty$ +to $\infty$ and therefore possibly derail statistics for the entire day. +Maybe, as part of removing implausible statistics, we should calculate the +ratio between reported value and calculated probability (see +Figure~\ref{fig:corr-probs-by-relay}) and exclude any outliers before the +extrapolation step. +\item The ribbon in Figure~\ref{fig:extrapolated-network-totals} implies a +confidence interval of some sort, but it's really only the standard error +of the local regression algorithm added by the graphing software. +We should instead calculate the confidence interval of our extrapolation, +similar to the simulation, and graph that. +But how do we calculate that? +\end{itemize} + +\end{document} + diff --git a/2015/extrapolating-hidserv-stats/tortechrep.cls b/2015/extrapolating-hidserv-stats/tortechrep.cls new file mode 120000 index 0000000..4c24db2 --- /dev/null +++ b/2015/extrapolating-hidserv-stats/tortechrep.cls @@ -0,0 +1 @@ +../../tortechrep.cls \ No newline at end of file