[ooni-dev] Volunteering some developer time

Arturo Filastò art at torproject.org
Fri Feb 13 15:18:11 UTC 2015


On 2/11/15 1:34 AM, Kevin Murray wrote:
> Hi,
> 

Hi Kevin,

Thanks for your interest in OONI!

> I'd love to get involved with the OONI project. I'm doing a PhD in
> high-performance/scientific computing, working with large experimental
> datasets. I have experience coding in C, python and R, and a modest
> understanding of statistics. I have also contributed code (mostly test cases)
> to little-t tor.
> 

These are all very useful skills especially in light of what are
perceived as the most high priority next steps (i.e. data analytics of
data collected by ooniprobe and visualizations).

Do you have experience working with mongodb or NoSQL like databases?

> Is there a particular part of the OONI infrastructure that would like a
> volunteer? If possible, it would be great to have a longer-term project,
> working with a mentor or similar, though I know everyone is very busy. I'm
> happy to work on any part of OONI.

I would be very happy to mentor you through working on the ooni-pipeline
(https://github.com/thetorproject/ooni-pipeline), that is currently
where most of the development effort is placed.

The next steps on that front I believe are:

1) Refactor the data structure of the reports and measurements placed
inside of the mongodb database.

We have learned the hard way that mongoDB does not seem to function like
normal databases in the sense that JOIN operations are not particularly
efficient. For this reason I think that instead of splitting the report
header and the measurements into 2 different tables we should just put
everything inside of 1. This one table will have all of the report
fields plus a "measurements" list that contains all the measurements
(that were previous stored inside of another table that referenced the
report entry).

This task is actually already implemented here:
https://github.com/TheTorProject/ooni-pipeline/commit/3be900736472a15b33e67a7f4cba8d6e9912571e#diff-efac5c63947a6db82050ce94f7bf283cR47

and I have run the import task on the new pipeline.

What now needs to change is the frontend and HTTP API to the database,
that can be found here:
https://github.com/hellais/ooni-app

In particular what needs to change is:
https://github.com/hellais/ooni-app/blob/master/app/controllers/reports.server.controller.js#L14

and

https://github.com/hellais/ooni-app/blob/master/app/controllers/reports.server.controller.js#L64


2) Come up with queries that will give us all the reports that are to be
considered "interesting".

Depending on the type of OONI test some elements of the result are
symptom of a network anomaly that can be a sign of internet censorship.
We should develop a set of mongoDB queries that give us for every test
the measurements that contain "interesting" results.

If you look at the "entry_filter" method of oonireader you can see what
are the signs of network anomalies for some of the most common
ooni-probe measurements:
https://github.com/TheTorProject/ooni-reader/blob/master/oonireader/nettests.py

These should be refactored to either be part of the ooni-pipeline or
even better be a set of MongoDB queries to be run against the database
(bonus points if we can come up with a smart way of caching the results
of these queries).

3) Devise a methodology for removing the dependency on Tor in the HTTP
Requests test, while still having control measurements close enough to
the probing time.

Currently when you run the HTTP Requests test the operator will perform
a HTTP request on the local network and one via Tor. The two results are
compared to assess if the expected result (the one over Tor) matches the
experiment result (the on over the local network).

This presents a variety of different problems: some sites will block the
Tor network, some operators live in a country where Tor is blocked, etc.

For this reason I think we should expose a HTTP service that allows an
ooniprobe to request some metadata associated to a certain website (for
example what are the HTTP headers it returns and how long is the body)
or obtain the full payload of the response.
If another operator requests the same site in a close enough range we
should not issue the request again, but just serve a cached copy of it.

An alternative would be to just query all of the sites that we want
users to probe with a fixed interval (say 1 hour) and do all the
analysis port submission (so we would look at the control measurement
that is closest to the experiment measurement).

These are the first things that come to mind, though feel free to also
look at the open trac tickets for ooni-probe and tell me if there is
something in particular that sparks your interest:
https://trac.torproject.org/projects/tor/query?status=accepted&status=assigned&status=needs_information&status=needs_review&status=needs_revision&status=new&status=reopened&component=Ooni&max=200&col=id&col=summary&col=component&col=status&col=type&col=priority&col=milestone&order=priority

In particular you may find interesting this:
https://trac.torproject.org/projects/tor/ticket/13731

I would also be very happy to discuss with you further other possible
options either on IRC (#ooni irc.oftc.net) or another channel of
communication of your choice.

Have fun!

~ Arturo


More information about the ooni-dev mailing list