<div dir="ltr">would you be okay with monthly archives of all tests, or would you want the archives to be separated by test type?<br></div><div class="gmail_extra"><br><div class="gmail_quote">On Fri, Mar 18, 2016 at 10:16 PM, David Fifield <span dir="ltr"><<a href="mailto:david@bamsoftware.com" target="_blank">david@bamsoftware.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">I just downloaded all the http_requests reports from<br>
<a href="https://measurements.ooni.torproject.org/" rel="noreferrer" target="_blank">https://measurements.ooni.torproject.org/</a>. It took quite a long time and<br>
I wonder if we can make things more efficient by compressing the reports<br>
on the server.<br>
<br>
This is the command I ran to download the reports:<br>
        wget -c -r -l 2 -np --no-directories -A '*http_requests*' --no-http-keep-alive <a href="https://measurements.ooni.torproject.org/" rel="noreferrer" target="_blank">https://measurements.ooni.torproject.org/</a><br>
This resulted in 309 GB and 6387 files.<br>
<br>
If I compress the files with xz,<br>
        xz -v *.json<br>
they only take up 29 GB (9%).<br>
<br>
Processing xz-compressed files is pretty easy, as long as you don't have<br>
to seek. Just do something like this:<br>
        def open_xz(filename):<br>
            p = subprocess.Popen(["xz", "-dc", filename], stdout=subprocess.PIPE, bufsize=-1)<br>
            return p.stdout<br>
<br>
        for line in open_xz("report.json"):<br>
            doc = json.loads(line)<br>
            ...<br>
Of course you can do the same thing with gzip.<br>
_______________________________________________<br>
ooni-dev mailing list<br>
<a href="mailto:ooni-dev@lists.torproject.org">ooni-dev@lists.torproject.org</a><br>
<a href="https://lists.torproject.org/cgi-bin/mailman/listinfo/ooni-dev" rel="noreferrer" target="_blank">https://lists.torproject.org/cgi-bin/mailman/listinfo/ooni-dev</a><br>
</blockquote></div><br></div>