[tor-dev] Some ideas on the visualization of OONI data

kudrom kudrom at riseup.net
Thu Oct 16 17:59:00 UTC 2014


Hi all
For the last couple of days i've been thinking about the visualization
of the bridge reachability data and how it relates to the currently
deployed ooni [7] system, here are the conclussions:

== Variables ==
I think that the statistical variables for the bridge reachability
reports are:
- Success of the nettest (yes, no, errors)
- The PT of the bridge (obfs3, obfs2, fte, vanilla)
- The pool from where the bridge has been extracted (private, tbb,
BridgeDB https, BridgeDB email)
- The country "of the ooni-probe"
With these variables I believe we can answer a lot of questions related
to how much censorship is being taken, where and how.

But there's something left: the timing. George sent an email [0] in
which he proposes a timeline of events [5] of every bridge that would
allow us to diagnose with much more precission how and why is a bridge
being censored. To build that diagram we should define first the events
that will be showed in the timeline. I think those events are the values
of the pool variable and if the bridge is being blocked in a given country.
With the events defined i think we can define another variable:
- Time deltas between bridge events.
So, for example, what this variable will answer is: how many {days,
hours...} does it take China to block a bridge that is published in
bridgeDB? Is China blocking new bridges at the same speed that Iran? How
many days does it take China block a private bridge?
There are some ambiguities related to the deltas, for example if the
bridge is sometimes blocked and sometimes not in a country, which delta
should we compute?

Finally, in the etherpad [1] the tor's bootstrap is suggested as a
variable, i don't understand why. Is it to detect some way of
censorship? Can anyone explain a little more?

== Data schema ==
In the last email Ruben, Laurier and Pascal "strongly recommended
importing the reports into a database". I deeply believe the same.
We should provide a service to query the values of the previous
variables plus the timestamp of the nettest and the fingerprint of the
bridge.
With this database the inconsistencies between the data formats of the
reports should be erased and the work with the data is much more easy.
I think that we should also provide a way to export the queries to
csv/json to allow other people to dig into the data.
I also believe that we could use mongodb just because one reason: we can
distribute it very easily. But let me explain why in the Future section.

== Biased data ==
Can a malicious ooni-probe bias the data? For example, if it executes in
bursts some tests the reports are going to be the same and the general
picture could be biased. Any more ideas?

== Geo Data ==
In the etherpad [1] it's suggested to increase the granularity of the
geo data to detect geographical patterns, but it seems [2] that at least
in China there's not such patterns so maybe we should discard the idea
altogether.

== Playing with data ==
So until now i've talked about data. Now i want to address how to
present the data.
I think we should provide a way to play with data to allow a more
thoughtful and precise diagnosis of censorship.
What i was thinking is to enhance the interactivity of the visualization
by allowing the user a way to render the diagrams at the same time she
thinks about the data.
The idea is to allow the user to go from more general to more concret
data patterns. So imagine that the user loads the visualization's page,
first he sees a global heated map of censorship measured with the bridge
reachability test, he is chinese so he clicks in his country and a
histogram like [3] for China is stacked at the bottom of the global map,
he then clicks on the obfs2 and a diagram like [4] is also stacked at
the bottom but only showing the success variable for the obfs2 PT, then
he clicks on the True value for the success variable and all the bridges
that have been reached by all the nettests executions in that period of
time in China are showed, finally he selects one bridge and it's
timeline [5] plus it's link to atlas [6] is provided.
This is only a particular scenario, the core idea is to provide the user
with the enhanced capability to drive conclusions as much as she desires.
The user started with the more general concept of the data, and he
applied restrictions to the datapoints to dig more into the data. From
general to specific he can start making hypothesis that he later
discards or approves with more info displayed in the next diagram.
There are some usability problems with the selection of diagram+variable
and the diverse set of users that will use the system, but i'd be very
glad to think about them if you like the idea.

== Users ==
I think there are three set of users:
1- User of tor that is interested in the censorship performed in its
country and how to avoid it.
2- Journalist that wants to write something about censorship but isn't
that tech savvy.
3- Researcher that wants updated and detailed data about censorship.

I believe we can provide a system that satisfies the three of them if we
succeed in the previous bullet point.

== Future ==
So, why do i think that we should index the data with mongodb? Because i
think that this data repository should be provided as a new ooni-backend
API related to the current collector.
Right now the collectors can write down reports from any ooni-probe
instance that chooses to do so and its API is completly separated from
the bouncer API, which overall is a wise design decision because you can
deploy ooni-backend to only work as a collector. So it's not
unreasonable to think that we can have several collectors collecting
different reports because the backend is designed to do so, therefore we
need the data repository to be distributed. And mongodb is good at this.
If we build the database for the bridge reachability nettests, i think
that we should design it to index in the future all nettest reports and
therefore generalize the desgin, implementation and deployment of all
the work that we are going to do to the bridge reachability.
That way an analyst can query the distributed database with a proper
client that connects to the data repository ooni-backend API.


So to sum up, I started talking about the bridge reachability
visualization problem and finished with a much broader vision that
intends to integrate the ongoing efforts of the bridge reachability to
improve ooni as a whole.
Hope the email is not too large.
ciao

[0] https://lists.torproject.org/pipermail/tor-dev/2014-October/007585.html
[1] https://pad.riseup.net/p/bridgereachability
[2] https://blog.torproject.org/blog/closer-look-great-firewall-china
[3] http://ooniviz.chokepointproject.net/transports.htm
[4] http://ooniviz.chokepointproject.net/successes.htm
[5] https://people.torproject.org/~asn/bridget_vis/tbb_blocked_timeline.jpg
[6] https://atlas.torproject.org
[7] https://ooni.torproject.org/


More information about the tor-dev mailing list