[ooni-dev] Mining OONI reports to find server-side Tor blocking (e.g. CloudFlare captchas)
David Fifield
david at bamsoftware.com
Fri Jun 19 17:03:38 UTC 2015
I want to search OONI reports for cases of Tor exits being blocked by
the server (things like the CloudFlare 403 captcha). The http_requests
test is great for that because it fetches a bunch of web pages with Tor
and without.
The attached script is my first-draft attempt at finding block pages.
Its output is at the end of this message. You can see it finds a lot of
CloudFlare captchas and other blocks.
First, I ran ooniprobe to get a report-http_requests.yamloo file. Then I
ran the script, which does this:
Skip the first YAML document, because it's a header.
For all other documents:
Skip it if it has non-None control_failure or
experiment_failure--there are a few of these.
Look for exactly two non-failed requests, one with is_tor:false
and one with is_tor:true. Skip it if it lacks these.
Classify the blocked status of the is_tor:false and is_tor:true
responses. 400-series and 500-series status codes are classified
as blocked and all others are unblocked.
Print an output line if the blocked status of is_tor:false does
not match the blocked status of is_tor:true.
I have a few questions.
Is this a reasonable way to process reports? Is there a more standard
way to do e.g. YAMLOO processing?
I know there are many reports at https://ooni.torproject.org/reports/.
Is that all of them? I think I heard from Arturo that some reports are
not online because of storage issues.
What's the best way for me to get the reports for processing? Just
download all *http_requests* files from the web server?
Here is the output of the script. 403-CLOUDFLARE is the famous
"Attention Required!" captcha page. I investigated some of the others
manually and they are mostly custom block pages or generic web server
403s. (There are also a couple of CloudFlare pages that have a different
form.) Overall, almost 4% of the 1000 URLs scanned by ooniprobe served a
block page over Tor.
I'm not sure what's up with the non-Tor 503s from Amazon. They just look
like localized internal service error pages ("ist ein technischer Fehler
aufgetreten", "une erreur de système interne a été décelée"). The one
for blog.com is a generic Nginx "Bad Gateway" page.
non-Tor Tor domain
302 403-OTHER yandex.ru
302 403-OTHER craigslist.org
301 403-CLOUDFLARE thepiratebay.se
503-OTHER 301 amazon.de
200 403-CLOUDFLARE adf.ly
301 403-OTHER squidoo.com
301 410-OTHER myspace.com
303 503-OTHER yelp.com
302 403-CLOUDFLARE typepad.com
503-OTHER 301 amazon.fr
301 403-CLOUDFLARE digitalpoint.com
301 403-CLOUDFLARE extratorrent.com
200 403-OTHER ezinearticles.com
200 403-OTHER hubpages.com
200 403-OTHER 2ch.net
200 403-OTHER hdfcbank.com
302 403-CLOUDFLARE meetup.com
302 403-CLOUDFLARE 1channel.ch
200 403-CLOUDFLARE multiply.com
301 403-CLOUDFLARE clixsense.com
301 403-OTHER zillow.com
301 403-CLOUDFLARE odesk.com
301 403-CLOUDFLARE elance.com
301 403-CLOUDFLARE youm7.com
200 403-CLOUDFLARE jquery.com
200 403-CLOUDFLARE sergey-mavrodi.com
301 403-CLOUDFLARE templatemonster.com
302 403-CLOUDFLARE 4tube.com
301 403-CLOUDFLARE mp3skull.com
301 403-CLOUDFLARE porntube.com
200 403-OTHER tutsplus.com
200 403-CLOUDFLARE bitshare.com
301 403-OTHER sears.com
200 403-CLOUDFLARE zwaar.net
502-OTHER 200 blog.com
302 403-CLOUDFLARE myegy.com
301 400-OTHER mercadolibre.com.ve
302 403-OTHER jabong.com
301 403-CLOUDFLARE free-tv-video-online.me
302 403-CLOUDFLARE traidnt.net
-------------- next part --------------
#!/usr/bin/env python
# Reads an OONI http_requests report and shows URLs that have known block pages.
#
# First, make an OONI report:
# ooniprobe -i /usr/share/ooni/decks/complete_no_root.deck
# Then,
# ./findblocks report-http_requests-XXXX.yamloo
import getopt
import sys
import yaml
from bs4 import BeautifulSoup
# Return (is_block, description) tuple. 4?? and 5?? status codes are considered
# blocks.
def classify_blockpage(response):
soup = BeautifulSoup(response["body"])
title = soup.title
code = response["code"]
if code == 403 and title is not None and title.get_text() == u"Attention Required! | CloudFlare":
return True, "403-CLOUDFLARE"
if code // 100 == 4 or code // 100 == 5:
return True, "%d-OTHER" % code
return False, "%d" % code
# Return a (nontor, tor) pair if there are exactly two requests and one is
# nontor and one is tor, or else raise an exception.
def split_requests(requests):
nontor = None
tor = None
for request in requests:
if request.get("failure") is not None:
continue
if not request["request"]["tor"]["is_tor"]:
if nontor is not None:
raise ValueError("more than one is_tor:false request")
nontor = request
else:
if tor is not None:
raise ValueError("more than one is_tor:true request")
tor = request
if nontor is None:
raise ValueError("no is_tor:false request")
if tor is None:
raise ValueError("no is_tor:true request")
return nontor, tor
def process_file(f):
yamloo = yaml.safe_load_all(f)
# First YAML doc in YAMLOO file is a header with a different format.
header = next(yamloo)
# Sanity check: make sure a header key is in there.
assert "input_hashes" in header
for doc in yamloo:
if doc["control_failure"] is not None:
print >> sys.stderr, "%s: control_failure=%s" % (doc["input"], doc["control_failure"])
continue
if doc["experiment_failure"] is not None:
print >> sys.stderr, "%s: experiment_failure=%s" % (doc["input"], doc["experiment_failure"])
continue
try:
nontor, tor = split_requests(doc["requests"])
except ValueError, e:
print >> sys.stderr, "%s: %s" % (doc["input"], str(e))
continue
nontor_isblocked, nontor_class = classify_blockpage(nontor["response"])
tor_isblocked, tor_class = classify_blockpage(tor["response"])
if nontor_isblocked != tor_isblocked:
print "%s\t%s\t%s" % (nontor_class, tor_class, doc["input"])
sys.stdout.flush()
def process_filename(filename):
with open(filename) as f:
return process_file(f)
opts, args = getopt.gnu_getopt(sys.argv[1:], "")
for filename in args:
process_filename(filename)
More information about the ooni-dev
mailing list