These programs are for mining historical network tests of the Open Observatory of Network Interference (OONI) to find instances of discrimination against Tor exit relays by web servers. It uses the OONI http_requests test, which requests a variety of URLs both with and without Tor. We are looking for cases where the Tor request receives a block page (for example a 403), and the non-Tor request does not. There is sample output in the file findblocks.csv.xz and in the graphs directory. Contact David Fifield regarding this code. All code is in the public domain. == Usage summary Download the index of OONI reports. wget -O ooni-reports.json http://api.ooni.io/api/reports Extract a list of report URLs using the included ooni-report-urls program. We are only interested in the http_requests report type. ./ooni-report-urls -t http_requests ooni-reports.json | sort | uniq > ooni-report-urls.txt Download the reports. There are a lot of them and they are pretty big (3084 reports weighing over 30 GB altogether as of 2015-08-27). You might want to start with just a subset (the most recent 100 reports, for example). wget -P reports -c -i ooni-report-urls.txt Run the findblocks program on the downloaded reports. For each URL in each report, findblocks outputs a CSV line that includes the classification (block/nonblock) of the Tor and non-Tor request. It additionally saves pages it considers to be block pages to a directory so you can inspect and manually classify them. ./findblocks --save-blocks blocks reports/*.yaml.gz | tee findblocks.csv As an example of what you can do with findblocks.csv, make some graphs: Rscript graphs.R Rscript poster.R == Details The OONI reports are gzip-compressed YAML files (.yaml.gz). Each file contains a sequence of YAML documents. The Python module ooni.py deals with processing these files. The function ooni_open_file handles gzip decompression and returns an iterator over individual YAML documents. A few documents have invalid Unicode sequences (separately encoded UTF-16 surrogates), like this example: title=\"Volume control\">\uD83D\uDD07
Attention Required! | CloudFlare" indicates a CloudFlare block page that would get the classification string "403-CLOUDFLARE". For details on the known block page classifications, see the file sample-blocks/README. findblocks can optionally save all the responses it thinks are blocks (use the --save-blocks option). We originally did this so we could analyze all the block pages and look for patterns for classification. If you look in the saved blocks directory, you'll see subdirectories named like these: 400-OTHER 403-AKAMAI 403-AMAZON-CLOUDFRONT 403-CLOUDFLARE 403-CRAIGSLIST 403-EZINEARTICLES 403-OTHER 403-YELP 404-OTHER 410-MYSPACE 500-OTHER 503-AMAZON 503-OTHER The "OTHER" directories contain all the pages that weren't recognized by classify.py. The saved files are named after the OONI report ID and the URL, for example: 403-CLOUDFLARE/2014-10-01jdgdoplqurezgrgrlqgvmpwqtylreduntfumkijm-http-sears.com%2F The files contain the reconstructed header and body of the original response. The OONI reports don't contain all the available information; for example we substitute "xxx" for the status message in the first line because OONI doesn't record it. The conversion from OONI's Unicode strings back to bytes is not necessarily lossless either. == Files ooni-report-urls This program processes a JSON index of OONI report filenames and turns them into a list of URLs. You can limit the reports to a specific type or specific countries. findblocks This program processes http_requests reports and writes a CSV file that summarizes the blocked/non-blocked status of each download. ooni-dump This program dumps all the responses from an http_requests report. It is like "findblocks --save-blocks" but it also saves nonblocks. classify.py A Python module that classifies HTTP responses. ooni.py A Python module for reading OONI YAML files. graphs.R poster.R These programs process the output of findblocks (findblocks.csv) and make some graphs. Run them with "Rscript graphs.R" or "Rscript poster.R". ooni-report-filenames.txt ooni-report-urls.txt ooni-test-urls.txt These are the input files that were used in building the sample graphs. graphs/ findblocks.csv findblocks.csv.xz These are sample outputs. sample-blocks/ These are examples of pages that classify.py considers to be block pages. The file sample-blocks/README has further details and justification for all the block page classification (in case you wonder what 501-CONVIO means, for example). sample-nonblocks/ These are pages with 400- or 500-series status codes that nevertheless should not be considered block pages. test.py This test program verifies what classify.py gets the correct classification for all the sample pages in sample-blocks and sample-nonblocks.