George Kadianakis desnacked@riseup.net writes:
Sam Burnett sam.burnett@gatech.edu writes:
Hi,
I'd like to help improve the Tor Censorship Detector. I've read some background material and think I understand the basics of George Danezis' detection algorithm [1, 2].
Is anyone still working on this? Two tickets from a year ago talk about experimenting with various detection algorithms and turning one of them into a standalone utility [3, 4]. Has anything happened since then?
My background: I'm a graduate student at Georgia Tech studying network censorship circumvention and measurement. Although I've met Tor developers on various occasions, I haven't directly contributed to the project; I'd like to change that.
Hi there,
this project has been pretty stale lately indeed.
Some random thoughts:
George Danezis started designing another censorship anomaly detection algorithm, but he never implemented it.
He did not like how the first detection algorithm triggered multiple times after a censorship event was reported (since all subsequent days after censorship deviated from the week before and it used a 7 days delta). Furthermore, he did not like that the algorithm only considered clients from a week ago; maybe a day ago would be a better model in some cases.
I think I have an email with his ideas on the new model; I will ask George whether I can send it to you.
With permission from George, I post his ideas for the second version of the censorship detector. He told me I should mention that the results of the second version were not much better than the results of the first version.
Inlining:
Subject: v2 model for censorship detection Date: Sat, 24 Sep 2011 19:16:44 +0000 From: George Danezis gdane@microsoft.com
Hello Karsten (and Roger),
Following our discussions about the current detector, data feels, and estimation of clients, I tried to improve our existing model using the data feeds already available. We can see how this new model addresses the issues of quality of detection, and then think carefully whether improving the client estimation from the directory authorities will improve it (since counting connections seems more complex than I had originally understood - thanks for the report explaining the intricacies).
In a nutshell, I used some of the principles we have used before - namely that current number of clients is related to clients at some point in that past, and that deviation in a particular country from global trends is a noteworthy event. Yet, our previous model was simple minded: (a) it only considered a global fixed previous number of clients (that we had set to be a week earlier to accommodate periodic effects from South Korea and Taiwan); (b) once censorship was detected multiple events were reported (about 7) since all subsequent days after censorship deviated from the week before. The new model addresses these two issues: it tries to fit the data to the best available model (one day before, or one week before), and then once a censorship event was detected it uses a censorship model for the estimation of new data points. Under the hood it is a Bayesian model, which means that we get probabilities of censorship events, not all-or-nothing alarms. These can be used along with your preference for false positives / false negatives to tune the reporting (a point that Roger made in previous emails).
The detector is based on the following model:
- At each point in time a county sees clients connecting at a certain hidden rate "r". The observation "o" you see is distributed according to a Poisson distribution with this parameter "r". This means that countries with few clients do not report censorship when none of them connect in a certain day - this is part of normal behaviour, and the detector should not be upset by jurisdictions with few clients.
- At any point in time we use one of many models, each model relating the current rate "r" with a rate at a previous time "r-1" or "r-7" (more previous times can be configured). Which model we use is determined by a parameter "t" which changes between time periods. We assume that these are rather stable, so the probability of changing the model from a time to the next is called "Beta", and is set rather low (<= 0.01). We restrict rates to be positive numbers (negative rates make no sense).
- At each point in time we also record whether a censorship event has just occurred, by a Boolean variable "c". This is set according to a prior that expresses the fact that censorship is rare ("alpha" = Pr[c = True] < 0.01). The censorship event is independent from previous censorship events.
- We observe the relative change in volumes of traffic from a selected number of large countries - about 50, from which we do not expect censorship. Using this data we built a different model at all times of the percentage change of traffic from one time to the next for each of our models. So in practice now we have a distribution that tells us how we expect the rate "r" to be, given the rate the previous day or the week before. These are simply normal distributions over the percent change with a mean and standard deviation that are estimated directly from the data series.
- The variance of the change observed depends on whether we are under conditions of censorship. We multiply the standard deviation of the normal models by 4 for non-censorship conditions, and with a factor of 10 for censorship conditions. This variable is called "m". Basically, that says: if we are under honest conditions the variance is small, but if we are under censorship we have little idea what happens. - Given the current censorship flag "c" and the current model "t" we decide whether we are under censorship conditions, by ORing all the previous "t" censorship flags. This ensures that we do not get those trains of events - once we are under censorship any model that is based on data before the censorship event will see its variance increase like crazy without reporting new events.
- Given a previous rate "r_p'" and a sample "v" from this normal distribution model, with the appropriate modification "m" to the variance to account for censorship, the new rate is r = r_p * (1 + v). The previous rate is chosen according to "t" (which model we are using, the day-before model or the week-before model)
The attached figure summarises the process using a standard graphical model. It depicts the variables at a particular time, and how they relate to other times and model parameters.
All the above is nice, but of course out of all the variables only the observations "o" are known, along with the parameters of the normal distributions relating rates (not observed) between time periods (a week or a day). We set the priors "alpha", "beta", and a fixed "m" as well. All other parameters, the rates "r", the censorship flags "c", the model selection "t" must be inferred for all times.
We infer those parameters using sequential monte carlo techniques (also known as particle filters). In a nutshell: you create a population of, say 100, particles with a reasonable (but not necessarily correct) set of parameters for time 0. At time 1, and subsequent times, you sample new parameters for these particles according to the distributions above: the model selection flag "t" changes with some probability, the censorship flag "c" is activated with some probability, and the rate is the result of sampling the normal distribution according to the model and the censorship conditions so far. This can be done by direct sampling (even though for technical convergence reasons I use importance sampling for "t" and "c"). After a new set of candidate particles are generated in this manner, they are weighted according to the likelihood they generated the observed data point "o". The final set of particles for the new time are re-sampled according to these weights. The procedure is repeated for the next time, given the particles of the previous time.
Once a set of particles is available for all time periods, you can simply compute the fraction of particles with a censorship event, or more generally under censorship conditions (=censorship event in the previous "t" times) to get the probability of censorship. I also recommend you plot the fraction of particles under censorship conditions to prevent one censorship event masking another happening shortly after. Furthermore some instances of censorship are not acute enough in a single time period to raise an alarm, but evidence of them amasses as time goes by through few censorship particles being selected to overwhelmingly explain future events.
The sampling seems to work well, BUT it takes a longer time than the previous detector to provide results. The speed depends on the number of particles used for the estimation - but the fewer particles the coarser / more noisy the results. I would recommend around 100 particles for quick diagnostics, and over 1000 particles for long term stats or jurisdictions that are likely to exhibit censorship. Of course, we might also get speed ups by carefully profiling the code, or re-writing it in C++ (I make extensive use of dynamic data structures for fast prototyping - simplifying those would help).
I attach 6 selected annotated graphs for china, Burma, Libya, Egypt where we should observe censorship as well as Germany and South Korea, where we should not observe censorship (but Korea is weird as we have discussed). Triangles and percentages indicate censorship events, while pink zones denote a high number of particles being under censorship conditions (20% and 50% according to intensity of pink).
I also attach the python script and data used to generate the graphs. You will need scipy, numpy and matplotlib to run it - as for the previous script. The function "plot_country" should give an idea how to use it, and the function "get_events" processes the particles to return the statistics used to plot the censorship events.
Let me know what you think,
George