[ooni-dev] Compressing reports?

David Fifield david at bamsoftware.com
Sat Mar 19 05:16:53 UTC 2016

I just downloaded all the http_requests reports from
https://measurements.ooni.torproject.org/. It took quite a long time and
I wonder if we can make things more efficient by compressing the reports
on the server.

This is the command I ran to download the reports:
	wget -c -r -l 2 -np --no-directories -A '*http_requests*' --no-http-keep-alive https://measurements.ooni.torproject.org/
This resulted in 309 GB and 6387 files.

If I compress the files with xz,
	xz -v *.json
they only take up 29 GB (9%).

Processing xz-compressed files is pretty easy, as long as you don't have
to seek. Just do something like this:
	def open_xz(filename):
	    p = subprocess.Popen(["xz", "-dc", filename], stdout=subprocess.PIPE, bufsize=-1)
	    return p.stdout

	for line in open_xz("report.json"):
	    doc = json.loads(line)
Of course you can do the same thing with gzip.

