<html>
<head>
<meta http-equiv="content-type" content="text/html; charset=ISO-8859-1">
</head>
<body bgcolor="#FFFFFF" text="#000000">
We had a mini hackfest at noisebridge on OONI and this thread is to
summarize what we discussed.<br>
<br>
mct: feel free to add things from your notes that are not detailed
here.<br>
<br>
One of the problems that we mostly focused our attention on was how
to detect censorship on HTTP when the user is presented with a block
page. The reason for focusing on this is that censorship over HTTP
is the most prevalent form.<br>
<br>
# Detecting censorship in web sites<br>
<br>
Detecting that the presented web page is not the expected site, but
is a censorship page turns out to be a non trivial task, especially
when you are dealing with web pages that have dynamic content. This
problem is detailed in this trac ticket:
<a class="moz-txt-link-freetext" href="https://trac.torproject.org/projects/tor/ticket/6180">https://trac.torproject.org/projects/tor/ticket/6180</a>.<br>
<br>
We ended up dividing the possible approaches into two macro
categories: Statistical and Heuristic [1]<br>
<br>
One of the properties we would like our classifier to have is that
that a user should be able to detect that the presented page is the
block page by just having a copy of OONI and some data that can be
given to them over a non censored channel (sneakernet for example).<br>
<br>
## DOMClass, a eigenvalue based statistical classifier<br>
<br>
What we ended up implementing is a classifier that considers the DOM
structure of the web page. We can easily build such a database of
the DOM structure features of the websites we are interested in
analyzing and ship these to users. <br>
<br>
This is how the algorithm works:<br>
<br>
This classifier uses the DOM structure of a website to
determine how similar<br>
the two sites are.<br>
The procedure we use is the following:<br>
* First we parse all the DOM tree of the web page and we
build a list of<br>
TAG parent child relationships (ex.
<html><a><b></b></a><c></c></html>
=><br>
(html, a), (a, b), (html, c)).<br>
<br>
* We then use this information to build a matrix (M) where
m[i][j] = P(of<br>
transitioning from tag[i] to tag[j]). If tag[i] does not
exists P() = 0.<br>
Note: M is a square matrix that is number_of_tags wide.<br>
<br>
* We then calculate the eigenvectors (v_i) and eigenvalues
(e) of M.<br>
<br>
* The corelation between page A and B is given via this
formula:<br>
correlation = dot_product(e_A, e_B), where e_A and e_B
are<br>
resepectively the eigenvalues for the probability matrix
A and the<br>
probability matrix B.<br>
<br>
You will note that the Matrix we are computing eigenvalues for
somewhat resembles a Markov model for transitions between DOM
element X to Y.<br>
<br>
This algorithm appears to work pretty well even on highly dynamic
web sites (such as the homepage of news.google.com).<br>
<br>
Problems with this algorithm:<br>
<br>
* It does not take into account the global position of DOM elements
or how deeply nested they are.
(<a><b><a><b></b></a></b></a>
(a,b),(b,a),(a,b) is equivalent to
<a><b><a></a></b></a><a><b></b></a>
(a,b), (b,a), (a,b))<br>
<br>
Looking into other possible solutions to this problem I took a look
at algorithms that are used to compute graph similarity.<br>
<br>
This problem could be solved by using an algorithm that calculates
the maximum isomorphic subgraph, but this problem is NP-hard. [2]<br>
I am unsure if the computation effort to use algorithms along these
lines is worth while.<br>
<br>
I can't seem to find the notes on the rest of the discussion,
perhaps mct will integrate with this discussion.<br>
<br>
- Art.<br>
<br>
[1] <a class="moz-txt-link-freetext" href="https://trac.torproject.org/projects/tor/ticket/6180#comment:3">https://trac.torproject.org/projects/tor/ticket/6180#comment:3</a><br>
[2]
<meta http-equiv="content-type" content="text/html;
charset=ISO-8859-1">
<a href="http://en.wikipedia.org/wiki/Subgraph_isomorphism_problem">
<meta http-equiv="content-type" content="text/html;
charset=ISO-8859-1">
</a><a
href="http://en.wikipedia.org/wiki/Maximum_common_subgraph_isomorphism_problem">http://en.wikipedia.org/wiki/Maximum_common_subgraph_isomorphism_problem</a>
<meta http-equiv="content-type" content="text/html;
charset=ISO-8859-1">
</body>
</html>