[tor-bugs] #6180 [Ooni]: Detecting censorship in HTTP pages

Wed Oct 31 14:09:09 UTC 2012

#6180: Detecting censorship in HTTP pages
----------------------------+-----------------------------------------------
 Reporter:  hellais         |          Owner:  hellais     
     Type:  task            |         Status:  needs_review
 Priority:  normal          |      Milestone:              
Component:  Ooni            |        Version:              
 Keywords:  SponsorH201206  |         Parent:              
   Points:                  |   Actualpoints:              
----------------------------+-----------------------------------------------

Comment(by isis):

 > We also talked about having clients tell the backend what it got as a
 response and having the
 > backend figure out if such a page should be a block page or the correct
 result.

 This is similar to what Bismark does: they have the client test node call
 back to a server through an ssh tunnel, and login to a restricted shell
 where it sets up a recovery tunnel and does a mysqldump. There is was also
 a script to email the person whose router is running the tests if no
 updates had been made in a while.

 Obviously we'd need to deal with several privacy issues, but if we wind up
 being allowed to run HSs on Mlab nodes, then we could possibly have the
 HTTP comparison done through that.

 I have done a bit of research into support vector machines and of course
 have studied bayesian inference, but I'm not a machine learning expert. I
 do know from the experience of spending two years training a lexigraphical
 fully-recurrent backpropagating neural network that training is about as
 much fun as punching yourself in the face. And, though I have not worked
 with them, and it is also a fast-progressing field, I believe that SVMs
 have trouble with fitting when the training and data sets are large
 because the radius function thing (I forget what that function is called)
 doesn't center on the data point correctly. There is also another thing
 which is much much simpler and easier to train, called a Relevant Vector
 Machine, which is basically just the covariance between the training and
 experimental sets, applied against a Gaussian distribution over a
 multidimensional space which represents "the test field", which is where
 defining the test field in an optimized fashion leads to the kernel trick.

 I do not know. I think if there exists a feasible machine learning
 algorithm for computing if a page is changed (if that even happens), or
 giving us a regex set for what the blocks are, then the censors would use
 it to find the pages.

 That said, I looked into libraries for hacking on this. There is a thing
 called OrangePy which looks pretty good, and I've played with PyBrain
 before and it was too bad.

-- 
Ticket URL: <https://trac.torproject.org/projects/tor/ticket/6180#comment:4>
Tor Bug Tracker & Wiki <https://trac.torproject.org/>
The Tor Project: anonymity online