We had a mini hackfest at noisebridge on OONI and this thread is to
summarize what we discussed.
mct: feel free to add things from your notes that are not detailed
here.
One of the problems that we mostly focused our attention on was how
to detect censorship on HTTP when the user is presented with a block
page. The reason for focusing on this is that censorship over HTTP
is the most prevalent form.
# Detecting censorship in web sites
Detecting that the presented web page is not the expected site, but
is a censorship page turns out to be a non trivial task, especially
when you are dealing with web pages that have dynamic content. This
problem is detailed in this trac ticket:
https://trac.torproject.org/projects/tor/ticket/6180.
We ended up dividing the possible approaches into two macro
categories: Statistical and Heuristic [1]
One of the properties we would like our classifier to have is that
that a user should be able to detect that the presented page is the
block page by just having a copy of OONI and some data that can be
given to them over a non censored channel (sneakernet for example).
## DOMClass, a eigenvalue based statistical classifier
What we ended up implementing is a classifier that considers the DOM
structure of the web page. We can easily build such a database of
the DOM structure features of the websites we are interested in
analyzing and ship these to users.
This is how the algorithm works:
This classifier uses the DOM structure of a website to
determine how similar
the two sites are.
The procedure we use is the following:
* First we parse all the DOM tree of the web page and we
build a list of
TAG parent child relationships (ex.
<html><a><b></b></a><c></c></html>
=>
(html, a), (a, b), (html, c)).
* We then use this information to build a matrix (M) where
m[i][j] = P(of
transitioning from tag[i] to tag[j]). If tag[i] does not
exists P() = 0.
Note: M is a square matrix that is number_of_tags wide.
* We then calculate the eigenvectors (v_i) and eigenvalues
(e) of M.
* The corelation between page A and B is given via this
formula:
correlation = dot_product(e_A, e_B), where e_A and e_B
are
resepectively the eigenvalues for the probability matrix
A and the
probability matrix B.
You will note that the Matrix we are computing eigenvalues for
somewhat resembles a Markov model for transitions between DOM
element X to Y.
This algorithm appears to work pretty well even on highly dynamic
web sites (such as the homepage of news.google.com).
Problems with this algorithm:
* It does not take into account the global position of DOM elements
or how deeply nested they are.
(<a><b><a><b></b></a></b></a>
(a,b),(b,a),(a,b) is equivalent to
<a><b><a></a></b></a><a><b></b></a>
(a,b), (b,a), (a,b))
Looking into other possible solutions to this problem I took a look
at algorithms that are used to compute graph similarity.
This problem could be solved by using an algorithm that calculates
the maximum isomorphic subgraph, but this problem is NP-hard. [2]
I am unsure if the computation effort to use algorithms along these
lines is worth while.
I can't seem to find the notes on the rest of the discussion,
perhaps mct will integrate with this discussion.
- Art.
[1] https://trac.torproject.org/projects/tor/ticket/6180#comment:3
[2]
http://en.wikipedia.org/wiki/Maximum_common_subgraph_isomorphism_problem