[tor-commits] [metrics-tasks/master] Move #2718 report sources to tech-reports.git.

Wed Jul 25 11:08:09 UTC 2012

commit fdd7a1d37813657166a66dc110952fe3e1e8ae2d
Author: Karsten Loesing <karsten.loesing at gmx.net>
Date:   Thu Jul 12 13:33:27 2012 +0200

    Move #2718 report sources to tech-reports.git.
---
 task-2718/detector.tex |  169 ------------------------------------------------
 1 files changed, 0 insertions(+), 169 deletions(-)

diff --git a/task-2718/detector.tex b/task-2718/detector.tex
deleted file mode 100644
index 10822d5..0000000
--- a/task-2718/detector.tex
+++ /dev/null
@@ -1,169 +0,0 @@
-\documentclass{article}
-\begin{document}
-\author{George Danezis\\{\tt gdane at microsoft.com}}
-\title{An anomaly-based censorship-detection\\system for Tor}
-\date{September 9, 2011}
-\maketitle
-
-\section{Introduction}
-
-The Tor project is currently the most widely used anonymity and censorship
-resistance system worldwide.
-As a result, national governments occasionally or regularly block access
-to its facilities for relaying traffic.
-Major blocking might be easy to detect, but blocking from smaller
-jurisdictions, with fewer users, could take some time to detect.
-Yet, early detection may be key to deploying countermeasures.
-We have designed an ``early warning'' system that looks for anomalies in
-the volumes of connections from users in different jurisdictions and flags
-potential censorship events.
-Special care has been taken to ensure the detector is robust to
-manipulations and noise that could be used to block without raising an
-alert.
-
-The detector works on aggregate number of users connecting to a fraction
-of directory servers per day.
-That set of statistics are gathered and provided by the Tor project in a
-sanitised form to minimise the potential for harm to active users.
-The data collection has been historically patchy, introducing wild
-variations over time that is not due to censorship.
-The detector is based on a simple model of the number of users per day per
-jurisdiction.
-That model is used to assess whether the number of users we observe is
-typical, too high, or too low.
-In a nutshell the prediction on any day is based on activity of previous
-days locally as well as worldwide.
-
-\section{The model intuition}
-
-The detector is based on a model of the number of connections from every
-jurisdiction based on the number of connections in the past as well as a
-model of ``natural'' variation or evolution of the number of connections.
-More concretely, consider that at time $t_i$ we have observed $C_{ij}$
-connections from country $j$.
-Since we are concerned with abnormal increases or falls in the volume of
-connections we compare this with the number of connections we observed at
-a past time $t_{i-1}$ denoted as $C_{(i-1)j}$ from the same country $j$.
-The ratio $R_{ij} = C_{ij} / C_{(i-1)j}$ summarises the change in the
-number of users.
-Inferring whether the ratio $R_{ij}$ is within an expected or unexpected
-range allows us to detect potential censorship events. 
-
-We consider that a ratio $R_{ij}$ within a jurisdiction $j$ is ``normal''
-if it follows the trends we are observing in other jurisdictions.
-Therefore for every time $t_i$ we use the ratios $R_{ij}$ of many
-countries to understand the global trends of usage of Tor, and we compare
-specific countries' ratios to this model.
-If they are broadly within the global trends we assume no censorship is
-taking place, otherwise we raise an alarm.
-
-\section{The model details}
-
-We model each data point $C_{ij}$ of the number of users connected at time
-$t_i$ from country $j$ as a single sample of a Poisson process with a rate
-$\lambda_{ij}$ modelling the underlying number of users.
-The Poisson process allows us to take into account that in jurisdictions
-with very few users we will naturally have some days of relatively low or
-high usage---just because a handful of users may or may not use Tor in a
-day.
-Even observing zero users from such jurisdictions on some days may not be
-a significant event. 
-
-We are trying to detect normal or abnormal changes in the rate of change
-of the rate $\lambda_{ij}$ between time $t_i$ and a previous time
-$t_{i-1}$ for jurisdiction $j$ compared with other jurisdictions.
-This is $\lambda_{ij} / \lambda_{(i-1)j}$ which for jurisdictions with a
-high number of users is very close to $C_{ij} / C_{(i-1)j} = R_{ij}$.
-We model $R_{ij}$'s from all jurisdictions as following a Normal
-distribution $N(m,v)$ with a certain mean ($m$) and variance ($v$) to be
-inferred.
-This is of course a modelling assumption.
-We use a normal distribution because given its parameters it represents
-the distribution with most uncertainty: as a result the model has higher
-variance than the real world, ensuring that it gives fewer false alarms of
-censorship.
-
-The parameters of $N(m,v)$ are inferred directly as point estimates from
-the readings in a set of jurisdictions.
-Then the probability of a given country ratio $R_{ij}$ is compared with
-that distribution: an alarm is raised if the probability of the ratio is
-above or below a certain threshold.
-
-\section{The model robustness}
-
-At every stage of detections we follow special steps to ensure the
-detection is robust to manipulation by jurisdictions interested in
-censoring fast without being detected.
-First the parameter estimation for $N(m,v)$ is hardened: we only use the
-largest jurisdictions to model ratios and within those we remove any
-outliers that fall outside four inter-quartile ranges of the median.
-This ensures that a jurisdiction with a very high or very low ratio does
-not influence the model of ratios (and can be subsequently detected as
-abnormal).
-
-Since we chose jurisdictions with many users to build the model of ratios,
-we can approximate the rates $\lambda_{ij}$ by the actual observed number
-of users $C_{ij}$.
-On the other hand when we try to detect whether a jurisdiction has a
-typical rate we cannot make this assumption.
-The rate of a Poisson variable $\lambda_{ij}$ can be inferred by a single
-sample $C_{ij}$ using a Gamma prior, in which case it follows a Gamma
-distribution.
-In practice (because we are using a single sample) this in turn can be
-approximated using a Poisson distribution with parameter $C_{ij}$.
-Using this observation we extract a range of possible rates for each
-jurisdiction based on $C_{ij}$, namely $\lambda_{ij_{min}}$ and
-$\lambda_{ij_{max}}$.
-Then we test whether that full range is within the typical range
-distribution---if not we raise an alarm.
-
-\section{The parameters}
-
-The deployed model considers a time interval of seven (7) days to model
-connection rates (i.e. $t_i$ - $t_{i-1} = 7$ days).
-The key reason for a weekly model is our observation that some
-jurisdictions exhibit weekly patterns.
-A `previous day' model would then raise alarms every time weekly patterns
-emerge.
-We use the 50 largest jurisdictions to build our models of typical ratios
-of traffic over time---as expected most of them are in countries where no
-mass censorship has been reported.
-This strengthens the model as describing ``normal'' Tor connection
-patterns.
-
-We consider that a ratio of connections is typical if it falls within the
-99.99~\% percentile of the Normal distribution $N(m,v)$ modelling ratios.
-This ensures that the expected rate of false alarms is about $1 / 10000$,
-and therefore only a handful a week (given the large number of
-jurisdictions).
-Similarly, we infer the range of the rate of usage from each jurisdiction
-(given $C_{ij}$) to be the 99.99~\% percentile range of a Poisson
-distribution with parameter $C_{ij}$.
-This full range must be within the typical range of ratios to avoid
-raising an alarm.
-
-\section{Further work}
-
-The detector uses time series of user connections to directory servers to
-detect censorship.
-Any censorship method that does not influence these numbers would as a
-result not be detected.
-This includes active attacks: a censor could substitute genuine requests
-with requests from adversary-controlled machines to keep numbers within
-the typical ranges.
-
-A better model, making use of multiple previous readings, may improve the
-accuracy of detection.
-In particular, when a censorship event occurs there is a structural
-change, and a model based on modelling the future of user loads before the
-event will fail.
-This is not a critical problem, as these ``false positives'' are
-concentrated after real censorship events, but the effect may be confusing
-to a reader.
-On the other hand, a jurisdiction can still censor by limiting the rate of
-censorship to be within the typical range for the time period concerned.
-Therefore adapting the detector to run on longer periods would be
-necessary to detect such attacks.
-
-\end{document}
-