# [tor-commits] [tech-reports/master] describing how to publish data about distributions

karsten at torproject.org karsten at torproject.org
Wed Jun 17 18:48:07 UTC 2015

commit 351b64d35dec5df7691db4596f995cf63e433fc3
Author: A. Johnson <aaron.m.johnson at nrl.navy.mil>
Date:   Sat Dec 27 18:44:13 2014 -0500

describing how to publish data about distributions
---
2015/hidden-service-stats/Makefile                 |   18 +++
2015/hidden-service-stats/hidden-service-stats.bib |    7 +
2015/hidden-service-stats/hidden-service-stats.tex |  157 ++++++++++++++++----
3 files changed, 156 insertions(+), 26 deletions(-)

diff --git a/2015/hidden-service-stats/Makefile b/2015/hidden-service-stats/Makefile
new file mode 100644
index 0000000..83a4dfd
--- /dev/null
+++ b/2015/hidden-service-stats/Makefile
@@ -0,0 +1,18 @@
+.PHONY : main
+
+MAIN=hidden-service-stats
+LATEX=pdflatex
+
+all: paper tidy
+
+paper: $(MAIN).tex +$(LATEX) $(MAIN) + bibtex$(MAIN)
+	$(LATEX)$(MAIN)
+	$(LATEX)$(MAIN)
+
+tidy:
+	rm -f *.dvi *.aux *.log *.nav *.snm *.toc *.out *.vrb *.bbl *.blg
+
+clean:
+	rm -f *.dvi *.aux *.log *.nav *.snm *.toc *.out *.vrb *.bbl *.blg $(MAIN).ps$(MAIN).pdf
diff --git a/2015/hidden-service-stats/hidden-service-stats.bib b/2015/hidden-service-stats/hidden-service-stats.bib
new file mode 100644
--- /dev/null
+++ b/2015/hidden-service-stats/hidden-service-stats.bib
@@ -0,0 +1,7 @@
+ at inproceedings{dwork-tcc2006,
+ author = {Dwork, Cynthia and McSherry, Frank and Nissim, Kobbi and Smith, Adam},
+ title = {Calibrating Noise to Sensitivity in Private Data Analysis},
+ booktitle = {Proceedings of the Third Conference on Theory of Cryptography},
+ series = {TCC'06},
+ year = {2006}
+}
\ No newline at end of file
diff --git a/2015/hidden-service-stats/hidden-service-stats.tex b/2015/hidden-service-stats/hidden-service-stats.tex
index d4b586e..1400268 100644
--- a/2015/hidden-service-stats/hidden-service-stats.tex
+++ b/2015/hidden-service-stats/hidden-service-stats.tex
@@ -3,6 +3,7 @@
\usepackage{graphicx}
\usepackage{hyperref}
\usepackage{longtable}
+\usepackage{paralist}

\begin{document}

@@ -617,6 +618,7 @@ statistic doesn't really provide any insight.

\subsubsection{Time between last and first published descriptor with same
identifier}
+\label{subsubsec:time_first_last_descriptor_update}

\textbf{Details:}
%
@@ -629,7 +631,7 @@ There is an upper bound on this statistic at 24 hours, because that's when
descriptor identifiers change.

\subsubsection{Number of introduction points contained in descriptors
-(3.1.4.)}
+(3.1.4.)} \label{subsubsec:num_ips_in_descriptors}

\textbf{Details:}
%
@@ -683,7 +685,7 @@ This doesn't seem like a problem that is solvable with simple obfuscation
of stats, and I suggest we don't do this statistic at all.

\subsubsection{Number of descriptor fetch requests by service identity
-(3.2.2.)}
+(3.2.2.)} \label{subsubsec:num_descriptor_fetches_per_hs}

\textbf{Details:}
%
@@ -752,7 +754,7 @@ requests a very popular hidden service gets.
% [dgoulet]: No they don't, I confirmed in the code.

\subsubsection{Number of introductions received by established
-introduction point (1.2.2.)}
+introduction point (1.2.2.)} \label{subsubsec:num_intros_per_circ}

\textbf{Details:}
%
@@ -804,7 +806,7 @@ given client or server.

\subsubsection{Number of cells sent over rendezvous circuits in either
-direction (2.3.2.)}
+direction (2.3.2.)} \label{subsubsection:num_cells_rend_circ}

\textbf{Details:}
%
@@ -835,7 +837,7 @@ period should be okay to measure, any statistics going further than that
need closer analysis.

\subsubsection{Time from first client data to tearing down circuit
-(2.3.3.)}
+(2.3.3.)} \label{subsubsec:time_client_data_to_teardown}

\textbf{Details:}
%
@@ -880,6 +882,7 @@ before client or service sent a single data cell.

\subsubsection{Time from establishing introduction point to receiving
first client introduction (1.2.4.)}
+\label{subsubsec:time_ip_est_to_introduce1}

\textbf{Details:}
%
@@ -905,7 +908,7 @@ This may not be very useful, but is listed here for completeness.
No obvious risks.

\subsubsection{Time from establishing a rendezvous point to receiving the
-server rendezvous (2.2.2.)}
+server rendezvous (2.2.2.)} \label{subsubsec:time_rp_to_rend1}

\textbf{Details:}
%
@@ -928,6 +931,7 @@ way to measure effectiveness of improvements in the deployed network.
Again, there are at least no obvious risks from gathering this statistic.

\subsubsection{Time from server rendezvous to first client data (2.3.1.)}
+\label{subsubsec:time_rend1_to_data}

\textbf{Details:}
%
@@ -1010,7 +1014,7 @@ A relay reports the number of published descriptors that it is not
responsible for.

\subsubsection{Number of descriptor fetch requests for non-existent
-descriptor (3.2.3.)}
+descriptor (3.2.3.)} \label{subsubsec:num_fetch_nonexistent}

\textbf{Details:}
%
@@ -1081,13 +1085,13 @@ The benefit gained from this statistic is not huge though.
%
No obvious risks.

-\section{Obfuscation methodology}
+\section{Obfuscation methodology} \label{sec:obfuscation}
The published statistics shouldn't reveal private information to an
adversary when combined with plausible background knowledge. We will use
techniques to provide uncertainty about any specific hidden service,
client, or connection, while maintaining good accuracy in the aggregate
statistics. These techniques include
-\begin{itemize}
+\begin{compactitem}
\item Releasing aggregate statistics over time, such as total counts or
averages in a given period
\item Adding noise (i.e. random inaccuracy)
@@ -1097,12 +1101,25 @@ averages in a given period
doesn't reveal information about ongoing activity
\item Using cryptographic techniques to hide the source of information,
such as anonymizing reports from individual relays
-\end{itemize}
+\end{compactitem}
+
+We will be adding noise in a way that provides differential
+privacy~\cite{dwork-tcc2006} for
+single'' actions. What constitutes a single action will depend on the
+specific statistic. For example, when publishing the number of unique
+descriptors seen at each HSDir, a single action could be publishing a
+descriptor to
+six relays. To obtain differential privacy, we will add noise using the
+Laplace distribution, which has a distribution function of
+$\textrm{Lap}(b) = e^{-|x|/b}/(2b)$. We will choose $b$ such that
+altering a single action will change the probability of the total output
+by a factor of at most $e^{\epsilon}$. Thus more privacy is provided the
+smaller that $\epsilon$ is.

We can expect that the adversary may know things such as
-\begin{itemize}
+\begin{compactitem}
\item The addresses of a large number of publicly-available services
(e.g. by crawling the Web)
\item A minimum amount of traffic received by a given hidden service
@@ -1112,7 +1129,7 @@ We can expect that the adversary may know things such as
periodically)
\item Roughly the number of client connections and amount of client
traffic (possibly leaked by the service itself, e.g. a web forum)
-\end{itemize}
+\end{compactitem}

\subsection{Counts}

@@ -1121,18 +1138,106 @@ For many statistics, it would be very helpful to understand the
distribution of values. For example, such information about descriptor
fetches could reveal if most hidden services are never used or if
there are a few hidden services that constitute most HS activity.
-Releasing information about the distribution of statistics could be useful
-for the following statistics:
-\begin{itemize}
-\item Time from circuit extension to circuit purpose change
-(Sec.~\ref{subsubsec:time_circ_ext_to_purpose_change})
-\item Time from circuit purpose change to tearing down circuit
-(Sec.~\ref{subsubsec:time_circ_purpose_change_to_teardown}
-\item Time from establishing introduction point to tearing down
-circuit (Sec.~\ref{subsubsec:time_intro_to_teardown})
-\item Number of descriptor updates per service
-\end{itemize}
+Table~\ref{table:dist_stats} lists the statistics for which it could
+be useful to release information about a distribution.
+\begin{table}
+\caption{Statistics with interesting distributions}
+\label{table:dist_stats}
+\begin{tabular}{|l|l|}
+\hline
+\textbf{Description} & \textbf{Section}\\
+\hline
+Time from circuit extension to circuit purpose change &
+\ref{subsubsec:time_circ_ext_to_purpose_change}\\
+\hline
+Time from circuit purpose change to tearing down circuit &
+\ref{subsubsec:time_circ_purpose_change_to_teardown}\\
+\hline
+Time from establishing introduction point to tearing down circuit &
+\ref{subsubsec:time_intro_to_teardown}\\
+\hline
+Number of descriptor updates per service &
+\hline
+Time between last and first published descriptor with same identifier &
+\ref{subsubsec:time_first_last_descriptor_update}\\
+\hline
+Number of introduction points contained in descriptors &
+\ref{subsubsec:num_ips_in_descriptors}\\
+\hline
+Number of descriptor fetch requests by service identity &
+\ref{subsubsec:num_descriptor_fetches_per_hs}\\
+\hline
+Number of introductions received by established introduction point &
+\ref{subsubsec:num_intros_per_circ}\\
+\hline
+Number of cells sent over rendezvous circuits in either direction &
+\ref{subsubsection:num_cells_rend_circ}\\
+\hline
+Time from first client data to tearing down circuit &
+\ref{subsubsec:time_client_data_to_teardown}\\
+\hline
+Time from establishing introduction point to receiving first client
+introduction & \ref{subsubsec:time_ip_est_to_introduce1}\\
+\hline
+Time from establishing a rendezvous point to receiving the
+server rendezvous & \ref{subsubsec:time_rp_to_rend1}\\
+\hline Time from server rendezvous to first client data &
+\ref{subsubsec:time_rend1_to_data}\\
+\hline
+Number of descriptor fetch requests for non-existent descriptor &
+\ref{subsubsec:num_fetch_nonexistent}\\
+\hline
+\end{tabular}
+\end{table}
+
+Following are several potential ways to publish information about a
+distribution:
+\begin{compactitem}
+\item Publish a histogram of possible values (e.g. the number of values in
+$[0,1]$, $[1,10]$, $[10, 100]$, and $[100, \infty)$).
+\item Publish a subset of percentile values (e.g. quartiles).
+\item Publish standard summary statistics, (e.g. mean, variance,
+skew, and kurtosis).
+\end{compactitem}
+To protect individual privacy when releasing these kinds of data,
+we would again like to protect activity over time and also provide
+particularly-strong protection for a single'' activity. This is quite
+straightforward to do for publishing
+histograms, simply by applying the techniques that we developed for counts
+to each
+count in the histogram. Thus we suggest using histograms in this way to
+report distribution data, as follows:
+\begin{compactenum}
+\item Choose a finite number of \emph{buckets} that cover the possible
+values of the statistic (we use the term buckets'' to distinguish
+these from bins that will limit the granularity of each bucket). Each
+extra bucket will result in a certain additional amount of noise being
+added, but including more values in a bucket (i.e. increasing its width)
+reduces its accuracy. Therefore, these should be balanced while also
+choosing buckets that capture the most useful distinctions
+for the statistic under consideration (e.g. deciding between relative and
+absolute accuracy).
+\item For each bucket, the count of values in that bucket should be
+rounded to a chosen granularity $\delta$ (e.g. to the nearest multiple of
+10). For  simplicity, it is recommended that bins are not split over
+multiple buckets (e.g. there should not be buckets for values 0 and 1 if
+bin granularity is at least 2). A rounded value is used because over time
+the effects of fresh noise can be factored out (e.g. by taking the mean
+of a sequence of published values if the statistic stays the same over
+that time).
+\item Fresh Laplace noise with distribution
+$\textrm{Lap}(2\delta/\epsilon)$ should be added to the center of the bin
+of each bucket, where $\epsilon$ is the privacy parameter discussed at the
+beginning of Sec.~\ref{sec:obfuscation}. $\delta$ appears in $b$ because
+a single input to the histogram could cause the bucket center to change
+by at most $\delta$ (e.g. if the rounding threshold is just crossed).
+The value $2$ appears in $b$ because modifying a single entry in the
+histogram can change two values: the value of the bucket it was changed
+from and the value of the bucket it was changed to.
+\item The noisy bin center of each bucket is published.
+\end{compactenum}
+

\section{Recommendation}
\label{sec:recommendation}
@@ -1147,5 +1252,5 @@ looking at the code.
an objective way, ideally using the stated evaluation criteria.
\end{itemize}

-\end{document}
-
+\bibliography{hidden-service-stats}
+\end{document}
\ No newline at end of file