We had a mini hackfest at noisebridge on OONI and this thread is to summarize what we discussed.
mct: feel free to add things from your notes that are not detailed here.
One of the problems that we mostly focused our attention on was how to detect censorship on HTTP when the user is presented with a block page. The reason for focusing on this is that censorship over HTTP is the most prevalent form.
# Detecting censorship in web sites
Detecting that the presented web page is not the expected site, but is a censorship page turns out to be a non trivial task, especially when you are dealing with web pages that have dynamic content. This problem is detailed in this trac ticket: https://trac.torproject.org/projects/tor/ticket/6180.
We ended up dividing the possible approaches into two macro categories: Statistical and Heuristic [1]
One of the properties we would like our classifier to have is that that a user should be able to detect that the presented page is the block page by just having a copy of OONI and some data that can be given to them over a non censored channel (sneakernet for example).
## DOMClass, a eigenvalue based statistical classifier
What we ended up implementing is a classifier that considers the DOM structure of the web page. We can easily build such a database of the DOM structure features of the websites we are interested in analyzing and ship these to users.
This is how the algorithm works:
This classifier uses the DOM structure of a website to determine how similar the two sites are. The procedure we use is the following: * First we parse all the DOM tree of the web page and we build a list of TAG parent child relationships (ex. <html><a><b></b></a><c></c></html> => (html, a), (a, b), (html, c)).
* We then use this information to build a matrix (M) where m[i][j] = P(of transitioning from tag[i] to tag[j]). If tag[i] does not exists P() = 0. Note: M is a square matrix that is number_of_tags wide.
* We then calculate the eigenvectors (v_i) and eigenvalues (e) of M.
* The corelation between page A and B is given via this formula: correlation = dot_product(e_A, e_B), where e_A and e_B are resepectively the eigenvalues for the probability matrix A and the probability matrix B.
You will note that the Matrix we are computing eigenvalues for somewhat resembles a Markov model for transitions between DOM element X to Y.
This algorithm appears to work pretty well even on highly dynamic web sites (such as the homepage of news.google.com).
Problems with this algorithm:
* It does not take into account the global position of DOM elements or how deeply nested they are. (<a><b><a><b></b></a></b></a> (a,b),(b,a),(a,b) is equivalent to <a><b><a></a></b></a><a><b></b></a> (a,b), (b,a), (a,b))
Looking into other possible solutions to this problem I took a look at algorithms that are used to compute graph similarity.
This problem could be solved by using an algorithm that calculates the maximum isomorphic subgraph, but this problem is NP-hard. [2] I am unsure if the computation effort to use algorithms along these lines is worth while.
I can't seem to find the notes on the rest of the discussion, perhaps mct will integrate with this discussion.
- Art.
[1] https://trac.torproject.org/projects/tor/ticket/6180#comment:3 [2] http://en.wikipedia.org/wiki/Subgraph_isomorphism_problemhttp://en.wikipedia.org/wiki/Maximum_common_subgraph_isomorphism_problem