<html>

  <head>

    <meta http-equiv="content-type" content="text/html; charset=ISO-8859-1">

  </head>

  <body bgcolor="#FFFFFF" text="#000000">

    We had a mini hackfest at noisebridge on OONI and this thread is to

    summarize what we discussed.<br>

    <br>

    mct: feel free to add things from your notes that are not detailed

    here.<br>

    <br>

    One of the problems that we mostly focused our attention on was how

    to detect censorship on HTTP when the user is presented with a block

    page. The reason for focusing on this is that censorship over HTTP

    is the most prevalent form.<br>

    <br>

    # Detecting censorship in web sites<br>

    <br>

    Detecting that the presented web page is not the expected site, but

    is a censorship page turns out to be a non trivial task, especially

    when you are dealing with web pages that have dynamic content. This

    problem is detailed in this trac ticket:

    <a class="moz-txt-link-freetext" href="https://trac.torproject.org/projects/tor/ticket/6180">https://trac.torproject.org/projects/tor/ticket/6180</a>.<br>

    <br>

    We ended up dividing the possible approaches into two macro

    categories: Statistical and Heuristic [1]<br>

    <br>

    One of the properties we would like our classifier to have is that

    that a user should be able to detect that the presented page is the

    block page by just having a copy of OONI and some data that can be

    given to them over a non censored channel (sneakernet for example).<br>

    <br>

    ## DOMClass, a eigenvalue based statistical classifier<br>

    <br>

    What we ended up implementing is a classifier that considers the DOM

    structure of the web page. We can easily build such a database of

    the DOM structure features of the websites we are interested in

    analyzing and ship these to users. <br>

    <br>

    This is how the algorithm works:<br>

    <br>

          This classifier uses the DOM structure of a website to

    determine how similar<br>

          the two sites are.<br>

          The procedure we use is the following:<br>

              * First we parse all the DOM tree of the web page and we

    build a list of<br>

                TAG parent child relationships (ex.

    <html><a><b></b></a><c></c></html>

    =><br>

                (html, a), (a, b), (html, c)).<br>

      <br>

              * We then use this information to build a matrix (M) where

    m[i][j] = P(of<br>

                transitioning from tag[i] to tag[j]). If tag[i] does not

    exists P() = 0.<br>

                Note: M is a square matrix that is number_of_tags wide.<br>

      <br>

              * We then calculate the eigenvectors (v_i) and eigenvalues

    (e) of M.<br>

      <br>

              * The corelation between page A and B is given via this

    formula:<br>

                correlation = dot_product(e_A, e_B), where e_A and e_B

    are<br>

                resepectively the eigenvalues for the probability matrix

    A and the<br>

                probability matrix B.<br>

    <br>

    You will note that the Matrix we are computing eigenvalues for

    somewhat resembles a Markov model for transitions between DOM

    element X to Y.<br>

    <br>

    This algorithm appears to work pretty well even on highly dynamic

    web sites (such as the homepage of news.google.com).<br>

    <br>

    Problems with this algorithm:<br>

    <br>

    * It does not take into account the global position of DOM elements

    or how deeply nested they are.

    (<a><b><a><b></b></a></b></a>

    (a,b),(b,a),(a,b) is equivalent to

    <a><b><a></a></b></a><a><b></b></a>

    (a,b), (b,a), (a,b))<br>

    <br>

    Looking into other possible solutions to this problem I took a look

    at algorithms that are used to compute graph similarity.<br>

    <br>

    This problem could be solved by using an algorithm that calculates

    the maximum isomorphic subgraph, but this problem is NP-hard. [2]<br>

    I am unsure if the computation effort to use algorithms along these

    lines is worth while.<br>

    <br>

    I can't seem to find the notes on the rest of the discussion,

    perhaps mct will integrate with this discussion.<br>

    <br>

    - Art.<br>

    <br>

    [1] <a class="moz-txt-link-freetext" href="https://trac.torproject.org/projects/tor/ticket/6180#comment:3">https://trac.torproject.org/projects/tor/ticket/6180#comment:3</a><br>

    [2]

    <meta http-equiv="content-type" content="text/html;

      charset=ISO-8859-1">

    <a href="http://en.wikipedia.org/wiki/Subgraph_isomorphism_problem">

      <meta http-equiv="content-type" content="text/html;

        charset=ISO-8859-1">

    </a><a

href="http://en.wikipedia.org/wiki/Maximum_common_subgraph_isomorphism_problem">http://en.wikipedia.org/wiki/Maximum_common_subgraph_isomorphism_problem</a>

    <meta http-equiv="content-type" content="text/html;

      charset=ISO-8859-1">

  </body>

</html>