[ooni-dev] Mining OONI reports to find server-side Tor blocking (e.g. CloudFlare captchas)

Fri Jun 19 17:03:38 UTC 2015

I want to search OONI reports for cases of Tor exits being blocked by
the server (things like the CloudFlare 403 captcha). The http_requests
test is great for that because it fetches a bunch of web pages with Tor
and without.

The attached script is my first-draft attempt at finding block pages.
Its output is at the end of this message. You can see it finds a lot of
CloudFlare captchas and other blocks.

First, I ran ooniprobe to get a report-http_requests.yamloo file. Then I
ran the script, which does this:
	Skip the first YAML document, because it's a header.
	For all other documents:
		Skip it if it has non-None control_failure or
		experiment_failure--there are a few of these.

		Look for exactly two non-failed requests, one with is_tor:false
		and one with is_tor:true. Skip it if it lacks these.

		Classify the blocked status of the is_tor:false and is_tor:true
		responses. 400-series and 500-series status codes are classified
		as blocked and all others are unblocked.

		Print an output line if the blocked status of is_tor:false does
		not match the blocked status of is_tor:true.

I have a few questions.

Is this a reasonable way to process reports? Is there a more standard
way to do e.g. YAMLOO processing?

I know there are many reports at https://ooni.torproject.org/reports/.
Is that all of them? I think I heard from Arturo that some reports are
not online because of storage issues.

What's the best way for me to get the reports for processing? Just
download all *http_requests* files from the web server?

Here is the output of the script. 403-CLOUDFLARE is the famous
"Attention Required!" captcha page. I investigated some of the others
manually and they are mostly custom block pages or generic web server
403s. (There are also a couple of CloudFlare pages that have a different
form.) Overall, almost 4% of the 1000 URLs scanned by ooniprobe served a
block page over Tor.

I'm not sure what's up with the non-Tor 503s from Amazon. They just look
like localized internal service error pages ("ist ein technischer Fehler
aufgetreten", "une erreur de système interne a été décelée"). The one
for blog.com is a generic Nginx "Bad Gateway" page.

non-Tor		Tor		domain
302		403-OTHER	yandex.ru
302		403-OTHER	craigslist.org
301		403-CLOUDFLARE	thepiratebay.se
503-OTHER	301		amazon.de
200		403-CLOUDFLARE	adf.ly
301		403-OTHER	squidoo.com
301		410-OTHER	myspace.com
303		503-OTHER	yelp.com
302		403-CLOUDFLARE	typepad.com
503-OTHER	301		amazon.fr
301		403-CLOUDFLARE	digitalpoint.com
301		403-CLOUDFLARE	extratorrent.com
200		403-OTHER	ezinearticles.com
200		403-OTHER	hubpages.com
200		403-OTHER	2ch.net
200		403-OTHER	hdfcbank.com
302		403-CLOUDFLARE	meetup.com
302		403-CLOUDFLARE	1channel.ch
200		403-CLOUDFLARE	multiply.com
301		403-CLOUDFLARE	clixsense.com
301		403-OTHER	zillow.com
301		403-CLOUDFLARE	odesk.com
301		403-CLOUDFLARE	elance.com
301		403-CLOUDFLARE	youm7.com
200		403-CLOUDFLARE	jquery.com
200		403-CLOUDFLARE	sergey-mavrodi.com
301		403-CLOUDFLARE	templatemonster.com
302		403-CLOUDFLARE	4tube.com
301		403-CLOUDFLARE	mp3skull.com
301		403-CLOUDFLARE	porntube.com
200		403-OTHER	tutsplus.com
200		403-CLOUDFLARE	bitshare.com
301		403-OTHER	sears.com
200		403-CLOUDFLARE	zwaar.net
502-OTHER	200		blog.com
302		403-CLOUDFLARE	myegy.com
301		400-OTHER	mercadolibre.com.ve
302		403-OTHER	jabong.com
301		403-CLOUDFLARE	free-tv-video-online.me
302		403-CLOUDFLARE	traidnt.net
-------------- next part --------------
#!/usr/bin/env python

# Reads an OONI http_requests report and shows URLs that have known block pages.
#
# First, make an OONI report:
#   ooniprobe -i /usr/share/ooni/decks/complete_no_root.deck
# Then,
#   ./findblocks report-http_requests-XXXX.yamloo

import getopt
import sys
import yaml

from bs4 import BeautifulSoup

# Return (is_block, description) tuple. 4?? and 5?? status codes are considered
# blocks.
def classify_blockpage(response):
    soup = BeautifulSoup(response["body"])
    title = soup.title
    code = response["code"]
    if code == 403 and title is not None and title.get_text() == u"Attention Required! | CloudFlare":
        return True, "403-CLOUDFLARE"
    if code // 100 == 4 or code // 100 == 5:
        return True, "%d-OTHER" % code
    return False, "%d" % code

# Return a (nontor, tor) pair if there are exactly two requests and one is
# nontor and one is tor, or else raise an exception.
def split_requests(requests):
    nontor = None
    tor = None
    for request in requests:
        if request.get("failure") is not None:
            continue
        if not request["request"]["tor"]["is_tor"]:
            if nontor is not None:
                raise ValueError("more than one is_tor:false request")
            nontor = request
        else:
            if tor is not None:
                raise ValueError("more than one is_tor:true request")
            tor = request
    if nontor is None:
        raise ValueError("no is_tor:false request")
    if tor is None:
        raise ValueError("no is_tor:true request")
    return nontor, tor

def process_file(f):
    yamloo = yaml.safe_load_all(f)

    # First YAML doc in YAMLOO file is a header with a different format.
    header = next(yamloo)
    # Sanity check: make sure a header key is in there.
    assert "input_hashes" in header

    for doc in yamloo:
        if doc["control_failure"] is not None:
            print >> sys.stderr, "%s: control_failure=%s" % (doc["input"], doc["control_failure"])
            continue
        if doc["experiment_failure"] is not None:
            print >> sys.stderr, "%s: experiment_failure=%s" % (doc["input"], doc["experiment_failure"])
            continue

        try:
            nontor, tor = split_requests(doc["requests"])
        except ValueError, e:
            print >> sys.stderr, "%s: %s" % (doc["input"], str(e))
            continue

        nontor_isblocked, nontor_class = classify_blockpage(nontor["response"])
        tor_isblocked, tor_class = classify_blockpage(tor["response"])

        if nontor_isblocked != tor_isblocked:
            print "%s\t%s\t%s" % (nontor_class, tor_class, doc["input"])
            sys.stdout.flush()

def process_filename(filename):
    with open(filename) as f:
        return process_file(f)

opts, args = getopt.gnu_getopt(sys.argv[1:], "")

for filename in args:
    process_filename(filename)