commit 0a077f97a1cfb4a1467e8044debec45b32c03825 Author: Karsten Loesing karsten.loesing@gmx.net Date: Thu Jan 22 17:45:28 2015 +0100
Incorporate feedback from Aaron. --- .../extrapolating-hidserv-stats.tex | 38 ++++++++++++++++---- 1 file changed, 31 insertions(+), 7 deletions(-)
diff --git a/2015/extrapolating-hidserv-stats/extrapolating-hidserv-stats.tex b/2015/extrapolating-hidserv-stats/extrapolating-hidserv-stats.tex index 60507a0..053a081 100644 --- a/2015/extrapolating-hidserv-stats/extrapolating-hidserv-stats.tex +++ b/2015/extrapolating-hidserv-stats/extrapolating-hidserv-stats.tex @@ -122,6 +122,14 @@ previously added noise. Negative values are the result of relays adding negative Laplace-distributed noise values to very small observed values. We will describe an attempt to remove such values shortly. +\footnote{A plausible step three in the previously described process could +have been to round negative values to 0, because that represents the most +likely rounded value before Laplace noise was added. +However, removing negative values would add bias to the result, because it +would only remove negative noise without being able to detect and remove +positive noise. +That's why we'd rather want to remove implausible values based on other +criteria.}
\begin{figure} \centering @@ -267,6 +275,20 @@ $b=8/0.3$ as defined for the respective statistics.} \label{fig:zero} \end{figure}
+Another cause for implausible statistics could be very large positive or +negative noise added by the Laplace distribution. +In theory, a single relay, with non-zero probability of observing +hidden-service activity, could have added noise from $-\infty$ to +$\infty$, which could derail statistics for the entire day. +These extreme values could be removed by calculating an interval of +plausible values for each relay, based on the probability of observing +hidden-service activity, and discarding values outside that interval. +Another option for avoiding these extreme values would be to cap noise +added at relays by adapting concepts from ($\epsilon,\delta)$-differential +privacy to the noise-generating code used by relays.% +\footnote{Whether or not either of these approaches is necessary depends +on whether or not our extrapolation method can handle outliers.} + \section{Extrapolating hidden-service traffic in the network}
We start the extrapolation of network totals with reported cells on @@ -353,6 +375,8 @@ hidden-service activity means publication of a hidden-service descriptor. We define this fraction as the part of descriptor space that the directory is responsible for, divided by \emph{three}, because each descriptor published to this descriptor is also published to two other directories. +Note that, without dividing the fraction of a relay's descriptor space by +three, fractions would not add up to 100%. Figure~\ref{fig:corr-probs-by-relay}~(right) shows the correlation of reported .onion addresses and fraction of hidden-service activity.
@@ -406,18 +430,18 @@ least 1% of hidden-service directories. \section{Open questions}
\begin{itemize} -\item The Laplace noise added by a single relay may range from $-\infty$ -to $\infty$ and therefore possibly derail statistics for the entire day. -Maybe, as part of removing implausible statistics, we should calculate the -ratio between reported value and calculated probability (see -Figure~\ref{fig:corr-probs-by-relay}) and exclude any outliers before the -extrapolation step. +\item Maybe we should switch back to the first extrapolation method, where +we're extrapolating from single observations, and then take the weighted +mean as best extrapolation result. +This has some advantages for handling outliers. +We'll want to run new simulations using this method. \item The ribbon in Figure~\ref{fig:extrapolated-network-totals} implies a confidence interval of some sort, but it's really only the standard error of the local regression algorithm added by the graphing software. We should instead calculate the confidence interval of our extrapolation, similar to the simulation, and graph that. -But how do we calculate that? +One option might be to run simulations as part of the extrapolation +process. \end{itemize}
\end{document}
tor-commits@lists.torproject.org