[tor-bugs] #2718 [Metrics]: Analyze Tor usage data for ways to automatically detect country-wide blockings

Thu Mar 24 16:17:08 UTC 2011

#2718: Analyze Tor usage data for ways to automatically detect country-wide
blockings
---------------------+------------------------------------------------------
 Reporter:  karsten  |          Owner:          
     Type:  task     |         Status:  assigned
 Priority:  normal   |      Milestone:          
Component:  Metrics  |        Version:          
 Keywords:           |         Parent:          
   Points:           |   Actualpoints:          
---------------------+------------------------------------------------------

Comment(by karsten):

 And here's my reply to George:

 Replying to [comment:2 karsten]:
 > - [Wide Variation] It seems that the Tor network is not very stable
 generally -- numbers of users fluctuate from one day to another, sometimes
 with no obvious reason and across jurisdictions. This maybe due to relays
 coming and going or the capacity of the network going up or down for other
 reasons. For this reason it is not reliable to look at the numbers of
 users from each jurisdiction and infer if they are lower than normal --
 since normal varies widely.

 Right, the data we have has a huge variation.  The good news is that we're
 going to have a higher fraction of relays reporting usage data in the near
 future.  I hope that our data quality will become better then.

 Speaking of input data to your algorithm.  How much does your algorithm
 care about absolute numbers, and would it be able to process raw
 observations made by relays and/or bridges?  These raw observations would
 tell you what fraction of requests or unique IP addresses were seen at a
 single relay or bridge coming from a given country.  For example:

 {{{
   dirreq-v3-reqs us=2368,de=1680,kr=1048,fr=800,[...]

   bridge-ips sa=48,us=40,de=32,ir=32,[...]
 }}}

 If you want to have a look, I can provide you with CSV-formatted data and
 tell you more details.

 > - [Difficult to study each series separately] [...]
 > - [Modelling inter-domain trends] [...]
 > - [Small-sample random variation] [...]
 > - [Full Model] [...]

 The assumptions made here all sound reasonable to me.  I guess we have to
 start somewhere and watch if the censorship detector results "make sense"
 to us.

 > - [Estimation delay window] One parameter of the model is the length of
 > the time periods. In other words: are we trying to model from today's
 > numbers of users what is going on tomorrow, or what is going on next
 > week? The previous days gives nice tight predictions, BUT some
 > jurisdictions show a really freakish weekly pattern -- thus I settled
 > for a 7 day window. This means that the value for today is used to
 > predict the value of the same day a week in the future.

 I wonder if we can use a 1-day window and a 7-day window at the same time.
 The 7-day window is giving us a few (likely false) alerts that a 1-day
 window wouldn't.  There's probably some influence from the day before and
 some influence from the week before.

 > - [Freakish weekly patterns] [...]

 I have no idea what's going on.  This is something that Roger might have
 an answer for.

 > - [Blind spots] [...]
 > - [Validation] [...]
 > - [Early warnings] [...]

 The assumptions and conlcusions made here make sense to me, too.

 > - [Code] All of the above is implemented using an ugly 300-line python
 > script with dependencies on scipy, numpy and matplotlib. I am cleaning
 > it up and will be happy to pass it on once it is stable and pretty.

 Yes, please.  I'm interested in the code, even if it's dirty Python.  If
 the code is at least somewhat readable, there's no need to clean it up.
 I'm fine running Python code with whatever dependencies are necessary to
 get this started.  Once we have a good idea what's going on, I might
 rewrite the relevant parts in Java, R, and ggplot2 for better integration
 with our existing codebase and to facilitate maintenance.  But right now,
 dirty Python is perfectly fine for a prototype phase.

 Do you mind if we put your code in Tor's Git repository for metrics code
 here?

   https://gitweb.torproject.org/metrics-tasks.git/tree/HEAD:/task-2718

 > - [Model refinement] This is a first, rough-and-ready model that I plan
 > on refining further: (a) automatically select the time-window (b) learn
 > the traffic self-similarity (c) offer a full Bayesian model + a particle
 > filter based sampler for whether an unexpected event is occurring. I
 > would be most happy for any feedback on this initial model -- what is
 > useful, what is useless, do you want more / less sensitivity, do you
 > know of events not detected, other sources of information for
 > prediction, etc.

 The comments here are my first thoughts.  I might have more thoughts when
 I see the code.

-- 
Ticket URL: <https://trac.torproject.org/projects/tor/ticket/2718#comment:3>
Tor Bug Tracker & Wiki <https://trac.torproject.org/>
The Tor Project: anonymity online