[ooni-dev] Mining OONI reports to find server-side Tor blocking (e.g. CloudFlare captchas)

Arturo Filastò art at torproject.org
Mon Jun 22 10:12:50 UTC 2015


On Jun 19, 2015, at 7:03 PM, David Fifield <david at bamsoftware.com> wrote:
> 
> I want to search OONI reports for cases of Tor exits being blocked by
> the server (things like the CloudFlare 403 captcha). The http_requests
> test is great for that because it fetches a bunch of web pages with Tor
> and without.
> 
> The attached script is my first-draft attempt at finding block pages.
> Its output is at the end of this message. You can see it finds a lot of
> CloudFlare captchas and other blocks.
> 
> First, I ran ooniprobe to get a report-http_requests.yamloo file. Then I
> ran the script, which does this:
> 	Skip the first YAML document, because it's a header.
> 	For all other documents:
> 		Skip it if it has non-None control_failure or
> 		experiment_failure--there are a few of these.
> 
> 		Look for exactly two non-failed requests, one with is_tor:false
> 		and one with is_tor:true. Skip it if it lacks these.
> 
> 		Classify the blocked status of the is_tor:false and is_tor:true
> 		responses. 400-series and 500-series status codes are classified
> 		as blocked and all others are unblocked.
> 
> 		Print an output line if the blocked status of is_tor:false does
> 		not match the blocked status of is_tor:true.
> 
> I have a few questions.
> 
> Is this a reasonable way to process reports? Is there a more standard
> way to do e.g. YAMLOO processing?
> 

If you are looking for robust standalone code for parsing YAML reports you should look at:
https://github.com/TheTorProject/ooni-pipeline-ng/blob/master/pipeline/helpers/report.py

It handles also skipping over report entries that are badly formatted and will already take care of separating the header from the entries.

If it becomes useful to do so we may eventually refactor that code out of the pipeline, though my dream is that people will not have to parse YAML reports themselves, but can rely on the data pipeline for getting the information they want in JSON (more on that below).

> I know there are many reports at https://ooni.torproject.org/reports/.
> Is that all of them? I think I heard from Arturo that some reports are
> not online because of storage issues.
> 

Currently the torproject mirror of reports is not the most up to date repository of reports, because it does not yet sync with the EC2 based pipeline.

You may find the most up to date reports (that are published daily) here:

http://api.ooni.io/

If you open the web console you will see a series of HTTP requests being made to the backend. With similar requests you can hopefully obtain the IDs of the specific tests you need and then download them.

> What's the best way for me to get the reports for processing? Just
> download all *http_requests* files from the web server?
> 


With this query you will get all the tests named “http_requests”:

http://api.ooni.io/api/reports?filter=%7B%22test_name%22%3A%20%22http_requests%22%7D

The returned list of dicts also contains an attribute called “report_filename” you can use that to download the actual YAML report via:

http://api.ooni.io/reportFiles/$DATE/$REPORT_FILENAME.gz

Note: don’t forget to put $DATE (that is the date in ISO format YYYY-MM-DD) and to add the .gz extension.


So this is what you can do today by writing a minor amount of code and not having to depend on us.

However I think this test is something that would be quite useful to us too in order to identify the various false positives we have in our reports, so I would like to add this intelligence to our database.

The best way to add support for processing this sort of information is writing a batch Spark task that will look for these report entries and add them to our database.

We currently have only implemented one of such filters, but will soon add support for also the basic heuristics of other tests too.

You can see how this is done here: 
https://github.com/TheTorProject/ooni-pipeline-ng/blob/master/pipeline/batch/spark_apps.py#L98

Basically in the find_interesting method you get passed an RDD (entries) that you can run various querying and filtering operations on: https://spark.apache.org/docs/latest/api/python/pyspark.html#pyspark.RDD.

To add support for spotting captchas I would add a new class called:

HTTPRequestsCaptchasFind(FindInterestingReports)

and

HTTPRequestsCaptchasToDB(InterestingToDB)

If you do this and it gets merged, then we can run this on an ephemeral hadoop cluster and/or set it up to run automatically every day.

> Here is the output of the script. 403-CLOUDFLARE is the famous
> "Attention Required!" captcha page. I investigated some of the others
> manually and they are mostly custom block pages or generic web server
> 403s. (There are also a couple of CloudFlare pages that have a different
> form.) Overall, almost 4% of the 1000 URLs scanned by ooniprobe served a
> block page over Tor.
> 
> I'm not sure what's up with the non-Tor 503s from Amazon. They just look
> like localized internal service error pages ("ist ein technischer Fehler
> aufgetreten", "une erreur de système interne a été décelée"). The one
> for blog.com is a generic Nginx "Bad Gateway" page.
> 
> non-Tor		Tor		domain
> 302		403-OTHER	yandex.ru
> 302		403-OTHER	craigslist.org
> 301		403-CLOUDFLARE	thepiratebay.se
> 503-OTHER	301		amazon.de
> 200		403-CLOUDFLARE	adf.ly
> 301		403-OTHER	squidoo.com
> 301		410-OTHER	myspace.com
> 303		503-OTHER	yelp.com
> 302		403-CLOUDFLARE	typepad.com
> 503-OTHER	301		amazon.fr
> 301		403-CLOUDFLARE	digitalpoint.com
> 301		403-CLOUDFLARE	extratorrent.com
> 200		403-OTHER	ezinearticles.com
> 200		403-OTHER	hubpages.com
> 200		403-OTHER	2ch.net
> 200		403-OTHER	hdfcbank.com
> 302		403-CLOUDFLARE	meetup.com
> 302		403-CLOUDFLARE	1channel.ch
> 200		403-CLOUDFLARE	multiply.com
> 301		403-CLOUDFLARE	clixsense.com
> 301		403-OTHER	zillow.com
> 301		403-CLOUDFLARE	odesk.com
> 301		403-CLOUDFLARE	elance.com
> 301		403-CLOUDFLARE	youm7.com
> 200		403-CLOUDFLARE	jquery.com
> 200		403-CLOUDFLARE	sergey-mavrodi.com
> 301		403-CLOUDFLARE	templatemonster.com
> 302		403-CLOUDFLARE	4tube.com
> 301		403-CLOUDFLARE	mp3skull.com
> 301		403-CLOUDFLARE	porntube.com
> 200		403-OTHER	tutsplus.com
> 200		403-CLOUDFLARE	bitshare.com
> 301		403-OTHER	sears.com
> 200		403-CLOUDFLARE	zwaar.net
> 502-OTHER	200		blog.com
> 302		403-CLOUDFLARE	myegy.com
> 301		400-OTHER	mercadolibre.com.ve
> 302		403-OTHER	jabong.com
> 301		403-CLOUDFLARE	free-tv-video-online.me
> 302		403-CLOUDFLARE	traidnt.net


Do you have an output that also includes the report_ids and exit IP?

I believe this data would be of great use to us too.

~ Arturo



More information about the ooni-dev mailing list