I am working on a project with Sheharbano Khattak, Sadia Afroz, Mobin
Javed, Srikanth Sundaresan, Vern Paxson, Steven Murdoch, and Damon McCoy
to measure how often web sites treat Tor users differently (by serving
them a block page or a captcha, for example). We used OONI reports for
part of the project. This post is about running our code and some
general tips about working with OONI data. I hope it can be of some use
to the ADINA15 participants :)
The source code I'm talking about is here:
git clone https://www.bamsoftware.com/git/ooni-tor-blocks.git
One of its outputs is here, a big poster showing the web sites with the
highest blocking rates against Tor users:
https://people.torproject.org/~dcf/graphs/tor-blocker-poster-20150914.pdf
I am attaching the README of our code. One of OONI's tests does URL
downloads with Tor and without Tor. The code processes OONI reports,
compares the Tor and non-Tor HTTP responses, and notes whenever Tor
appears to be blocked while non-Tor appears to be unblocked.
Our code is focused on the task of finding Tor blocks, but parts of it
are generic and will be useful to others who are working with OONI data.
The ooni-report-urls program gives you the URLs of every OONI report
published at api.ooni.io. The ooni.py Python module provides an iterator
over OONI YAML files that deals with encoding errors and compression.
The classify.py Python module is able to identify many common types of
block pages (e.g. CloudFlare, Akamai).
Now some notes and lessons learned about working with OONI data.
The repository of all OONI reports is here:
http://api.ooni.io/
The web site doesn't make it obvious, but there is a JSON index of all
reports, so you can download many of them in bulk (thanks to Arturo for
pointing this out). Our source code contains a program called
ooni-report-urls that extracts the URLs from the JSON file so you can
pipe them to Wget or whatever. (Check before you start downloading,
because there are a lot of files and some of them are big!)
wget -O ooni-reports.json http://api.ooni.io/api/reports
./ooni-report-urls ooni-reports.json | sort | uniq > ooni-report-urls.txt
The choice of a YAML parser really really matters, like 30× performance
difference matters. See here:
https://bugs.torproject.org/13720
yaml.safe_load_all(f) function is slow.
yaml.load_all(f, Loader=yaml.CSafeLoader) is what you want to use
instead. yaml.CSafeLoader differs slightly in its handling of certain
invalid Unicode escapes that can appear in OONI's representation of HTTP
bodies, for example separately encoded UTF-16 surrogates:
"\uD83D\uDD07". ooni.py has a way to skip over records like that (there
aren't very many of them). With yaml.CSafeLoader, findblocks takes about
2 hours to process 2.5 years of http_requests reports (about 33 GB
compressed).
There are some inconsistencies and format differences in some OONI
reports, particularly very early ones. For example, the test_name field
of reports is not always the same for the same test. We were looking for
http_requests tests, and we had to match all of the following
test_names:
http_requests
http_requests_test
tor_http_requests_test
HTTP Requests Test
In addition, the YAML format is occasionally different. In http_requests
reports, for example, the way of indicating that Tor is in use for a
request can be any of:
tor: true
tor: {is_tor: true}
tor: {exit_ip: 109.163.234.5, exit_name: hessel2, is_tor: true}
And even in some requests, the special URL scheme "shttp" indicates a
Tor request; e.g. "shttp://example.com/". The ooni.py script fixes up
some of these issues, but only for the http_requests test. You'll have
to figure it out on your own for other tests.
A very early version of this processing code appeared here:
https://lists.torproject.org/pipermail/ooni-dev/2015-June/000288.html
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA256
Hi Arturo,
- ---
Currently for our data processing needs we have begun to bucket
reports by date (every date corresponds to when a certain report has
been submitted to the collector). What I would like to know is of the
two following options what would be most convenient to you for
accessing the data.
The options are:
OPTION A:
Have 1 JSON stream for every day of measurements (either gzipped or plai
n)
ex.
- https://ooni.torproject.org/reports/json/2016-01-01.json
- https://ooni.torproject.org/reports/json/2016-01-02.json
- https://ooni.torproject.org/reports/json/2016-01-03.json
etc.
OPTION B:
Have 1 JSON stream for every ooni-probe test run and publish them
inside of a directory with the timestamp of when it was collected
ex.
- - https://ooni.torproject.org/reports/json/2016-01-01/20160101T204732Z
NL-AS3265-http_requests-v1-probe.json.gz
- - https://ooni.torproject.org/reports/json/2016-01-01/20160101T204732Z
US-AS3265-dns_consistency-v1-probe.json.gz
etc.
Since we are internally using the daily batches for doing the
processing and analysis of reports unless there is an explicit request
to publish them on a test run basis we will probably end up going for
option A, so don’t be shy to reply :)
- ---
I agree with David in that it will be easier to access specific
ooni-probe test results using option (B) (i.e. the current solution).
What benefits did you identify when considering to switch to option (A)?
A few reasons to stick with option (B) include:
- - Retaining the ability to run ooni-pipeline on a subset of reports
associated with a given time period by filtering by date prefix, and
substrings within key names;
- - Retaining the ability to distribute small units of work easily among
subprocesses; and
- - Retaining the idempotent nature of ooni-pipeline, and the luigi
framework - switching from lots of small files to a single large file
for a given day will invariably increase the time required to recover
from failures (i.e. if a small dnst-based test fails to normalise,
you'll have to renormalise everything as opposed to a single test;
- - Developers will not have to download hundreds of megabytes of data
in order to access a traceroute test result that is only a few
kilobytes in size; and
- - It's generally easier to work with smaller files than it is to work
with big files.
Cheers,
Tyler
GPG fingerprint: 8931 45DF 609B EE2E BC32 5E71 631E 6FC3 4686 F0EB
(tyler(a)tylerfisher.org)
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iQIcBAEBCAAGBQJWn1+qAAoJEGMeb8NGhvDr6hUP/R6XcXEejwT8DYuKLoVBpujs
CqXtIj88A5JYhtt1npRF/a0peNihzFRbYuQpAUX/D1EdPa1UHDwCuqp1hO642xIJ
WePgWHWIS7qzYK/i5LbMXC+oWmfAA0J25SawmjyNWclK+NCgIwQ1k7kleyFP7Ul5
PFjJKLCcuqJkQl1hnZlW7YhgLYZAf2QHOD1cJauLM5aDCNBDUSgfIP+/P/xfFLq3
XqLGFBfNrMXaWmOfDGLR7tV4mS3R4M5L7rL66AiomQULdld4cuLonAht4CWLDuhV
MKyKrURixRqgTUoing59OcjgOcGEVQD5P5NaMuVruU1hFbHW0wr430/mEq8pdDW9
BQHZh/VZ/f2xz4rjWiE8Mfl3mgmGfbFiT6WMKQTRY3vr5mwbmefg0/IneJ1eHtIo
A/XMX579DQt3V19tMa7rO4TjpdBKIWwJ8/6mwwaw9QrS/I2pmlg8AscLU0oQtMqc
3CcWELOdoV7uIPBVg3TfiL+RLDSxzIJIp0k6IM19tkwZAxGLmD+cZvDo+dxME3fd
Y7+fYxovuQrt4vvhaPFU15EDzFWHMoMqlNUSOeC0FuIhpbYbM0Dqn1qL4EScJ+PA
+p/rtMTZLiiIJLtYuZCRiZjbaMvqsfZAmPEZ5ZgSShJKsB4jhV4/5LsmHedYCHeW
S/IdJ81xUzTfxMUZBqgI
=kztT
-----END PGP SIGNATURE-----
Hello Oonitarians,
This is a reminder that today there will be the weekly OONI gathering.
It will happen as usual on the #ooni channel on irc.oftc.net at 17:00
UTC (18:00 CEST, 12:00 EST, 09:00 PST).
You can join via the web from: https://kiwiirc.com/client/irc.oftc.net/ooni (Note: sometimes Tor is blocked by OFTC, but it should mask your IP if you trust that stuff).
Everybody is welcome to join us and bring their questions and feedback.
See you later,
~ Arturo
Hello Oonitarians,
This is a reminder that today there will be the weekly OONI gathering.
It will happen as usual on the #ooni channel on irc.oftc.net at 17:00
UTC (18:00 CEST, 12:00 EST, 09:00 PST).
You can join via the web from: https://kiwiirc.com/client/irc.oftc.net/ooni (Note: sometimes Tor is blocked by OFTC, but it should mask your IP if you trust that stuff).
Everybody is welcome to join us and bring their questions and feedback.
See you later,
~ Arturo
Hello Oonitarians,
This is a reminder that today there will be the weekly OONI gathering.
It will happen as usual on the #ooni channel on irc.oftc.net at 17:00
UTC (18:00 CEST, 12:00 EST, 09:00 PST).
You can join via the web from: https://kiwiirc.com/client/irc.oftc.net/ooni (Note: sometimes Tor is blocked by OFTC, but it should mask your IP if you trust that stuff).
Everybody is welcome to join us and bring their questions and feedback.
See you later,
~ Arturo