[tor-dev] [ooniprobe-dev] Summary of OONI hackfest

Arturo Filastò art at torproject.org
Wed Aug 22 07:01:21 UTC 2012


We had a mini hackfest at noisebridge on OONI and this thread is to 
summarize what we discussed.

mct: feel free to add things from your notes that are not detailed here.

One of the problems that we mostly focused our attention on was how to 
detect censorship on HTTP when the user is presented with a block page. 
The reason for focusing on this is that censorship over HTTP is the most 
prevalent form.

# Detecting censorship in web sites

Detecting that the presented web page is not the expected site, but is a 
censorship page turns out to be a non trivial task, especially when you 
are dealing with web pages that have dynamic content. This problem is 
detailed in this trac ticket: 
https://trac.torproject.org/projects/tor/ticket/6180.

We ended up dividing the possible approaches into two macro categories: 
Statistical and Heuristic [1]

One of the properties we would like our classifier to have is that that 
a user should be able to detect that the presented page is the block 
page by just having a copy of OONI and some data that can be given to 
them over a non censored channel (sneakernet for example).

## DOMClass, a eigenvalue based statistical classifier

What we ended up implementing is a classifier that considers the DOM 
structure of the web page. We can easily build such a database of the 
DOM structure features of the websites we are interested in analyzing 
and ship these to users.

This is how the algorithm works:

       This classifier uses the DOM structure of a website to determine 
how similar
       the two sites are.
       The procedure we use is the following:
           * First we parse all the DOM tree of the web page and we 
build a list of
             TAG parent child relationships (ex. 
<html><a><b></b></a><c></c></html> =>
             (html, a), (a, b), (html, c)).

           * We then use this information to build a matrix (M) where 
m[i][j] = P(of
             transitioning from tag[i] to tag[j]). If tag[i] does not 
exists P() = 0.
             Note: M is a square matrix that is number_of_tags wide.

           * We then calculate the eigenvectors (v_i) and eigenvalues 
(e) of M.

           * The corelation between page A and B is given via this formula:
             correlation = dot_product(e_A, e_B), where e_A and e_B are
             resepectively the eigenvalues for the probability matrix A 
and the
             probability matrix B.

You will note that the Matrix we are computing eigenvalues for somewhat 
resembles a Markov model for transitions between DOM element X to Y.

This algorithm appears to work pretty well even on highly dynamic web 
sites (such as the homepage of news.google.com).

Problems with this algorithm:

* It does not take into account the global position of DOM elements or 
how deeply nested they are. (<a><b><a><b></b></a></b></a> 
(a,b),(b,a),(a,b) is equivalent to <a><b><a></a></b></a><a><b></b></a> 
(a,b), (b,a), (a,b))

Looking into other possible solutions to this problem I took a look at 
algorithms that are used to compute graph similarity.

This problem could be solved by using an algorithm that calculates the 
maximum isomorphic subgraph, but this problem is NP-hard. [2]
I am unsure if the computation effort to use algorithms along these 
lines is worth while.

I can't seem to find the notes on the rest of the discussion, perhaps 
mct will integrate with this discussion.

- Art.

[1] https://trac.torproject.org/projects/tor/ticket/6180#comment:3
[2] 
<http://en.wikipedia.org/wiki/Subgraph_isomorphism_problem>http://en.wikipedia.org/wiki/Maximum_common_subgraph_isomorphism_problem 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.torproject.org/pipermail/tor-dev/attachments/20120822/a4a0780d/attachment.html>


More information about the tor-dev mailing list