
I just downloaded all the http_requests reports from https://measurements.ooni.torproject.org/. It took quite a long time and I wonder if we can make things more efficient by compressing the reports on the server. This is the command I ran to download the reports: wget -c -r -l 2 -np --no-directories -A '*http_requests*' --no-http-keep-alive https://measurements.ooni.torproject.org/ This resulted in 309 GB and 6387 files. If I compress the files with xz, xz -v *.json they only take up 29 GB (9%). Processing xz-compressed files is pretty easy, as long as you don't have to seek. Just do something like this: def open_xz(filename): p = subprocess.Popen(["xz", "-dc", filename], stdout=subprocess.PIPE, bufsize=-1) return p.stdout for line in open_xz("report.json"): doc = json.loads(line) ... Of course you can do the same thing with gzip.