[tor-dev] OONI hackfest summary
art at torproject.org
Tue Nov 4 10:26:57 UTC 2014
>From October 24th to 26th the OONI team gathered in Berlin for a
hackfest. Around 20 people ended up showing up and although most of them
were seasoned Oonitarians some fairly new people joined us that I hope
will become part of the growing OONI community.
The scope of the hackfest was that of data analytics and visualization
with special focus on the Tor bridge reachability study we are currently
# Bridge reachability study
The goal of this study  is that of answering some questions
concerning the blocking of Tor bridges  and pluggable transport 
enabled bridges in the countries of China, Iran, Russia and Ukraine
(test vantage points).
To establish a baseline to eliminate the cases in which the bridge is
marked as blocked, while it is in fact just offline, we measure also
from a vantage point located in the Netherlands.
For every test vantage point we perform two types of measurements:
* A Bridge reachability measurement  that attempts to build a tor
circuit using the bridge in question
* A TCP connect measurement  that simply does a TCP connect to the
bridge IP and port
We run both of the measurements to further debug the reason why the
blocking is happening, may this be due to a TCP RST or direct IP
blocking or tor malfunction.
So far this study has been running for a little less than 1 month.
# OONI data pipeline
In order to produce the aggregate data needed to build visualizations we
have built a data pipeline 
This consists of a series of operations that are done to the raw reports
in order to strip out sensitive information and place the collected data
into a database.
The nice thing is that the data pipeline we have designed is not
specific to this study, but can and will be in the future expanded to
export data needed to visualize also the other types of measurements
done by OONI.
The data pipeline is comprised of 3 steps (or states, depending on how
you want to look at it).
When the data is submitted to a OONI collector it is synchronized with
This is a central machine responsible for running all the data
processing tasks, storing the collected data in a database and hosting a
public interface to the sanitised reports. Since all the steps are
independent from one another it is not necessary that they run on the
machine, but it may also be more distributed.
Once the data is on the aggregator machine it is said to be in the RAW
state. The sanitise task is then run on the RAW data to remove sensitive
information and strip out some superfluous information. A RAW copy of
every report is also stored in a private compressed archive for future
Once the data is sanitised it is said to tbe in SANITISED state. At this
point a import task is run on the data to place it inside of a database.
The SANITISED reports are then place in a directory that is publicly
exposed to the internet to allow people to download also a copy of the
At this point is is possible to run any export task that performs
queries on the database and produces as output some documents to be used
in the data visualizations (think JSON, CSV, etc.).
# The OONI hackfest
The first day of the hackfest was spent going over the scope of the
project we would be working on in the following days as well as working
in groups that were interested in tacking the design of one aspect of
Sticky notes were plentiful and helped us have a clear vision of what
lied ahead of us.
By the end of the first day we had clear what were the set of tasks that
were needed to achieve our goals and which teams would be responsible
for doing what.
The second day was almost entirely dedicated to hacking and everybody
had a task to complete that was either completed by the end of the day
or sooner. Some people even completed their initially assigned task
before the end of the day and came back asking for more!
By the end of the second day we had a real data set to hand over to the
visualization team, to start producing some pretty graphs based on real
We decided that the first visualization we wanted to do should be kept
as simple as possible and be something that we could also use to debug
the data we had collected. It should tell us which bridges were working
when and it should present the information in a way that would highlight
the country involved and the pluggable transport type.
A prototype of it can be seen here:
The code for this visualization can be found here:
# Next steps
* Write scripts for generating the bridge_db.json document based on the
data that is given to us from the bridge db team
* Align the dates in the visual timeline
* Better tokenising for bridges so that bridges that have the same
fingerprint, but different transport are grouped properly
* Finish setting up the docker containers for the steps of the data
* Setup disaster recovery procedure and backup:
* Setup monitoring of the probes.
* Add support for obfs4
* Set upper bound in comparison with the control in the bridge
* Make sure that the control measurement is for the specific bridge
Questions and comments should be directed to the ooni-dev mailing list
or to the #ooni channel on irc.oftc.net.
More information about the tor-dev