-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA256
Hi Arturo,
- --- Currently for our data processing needs we have begun to bucket reports by date (every date corresponds to when a certain report has been submitted to the collector). What I would like to know is of the two following options what would be most convenient to you for accessing the data.
The options are:
OPTION A: Have 1 JSON stream for every day of measurements (either gzipped or plai n)
ex. - https://ooni.torproject.org/reports/json/2016-01-01.json - https://ooni.torproject.org/reports/json/2016-01-02.json - https://ooni.torproject.org/reports/json/2016-01-03.json etc.
OPTION B: Have 1 JSON stream for every ooni-probe test run and publish them inside of a directory with the timestamp of when it was collected
ex. - - https://ooni.torproject.org/reports/json/2016-01-01/20160101T204732Z NL-AS3265-http_requests-v1-probe.json.gz - - https://ooni.torproject.org/reports/json/2016-01-01/20160101T204732Z US-AS3265-dns_consistency-v1-probe.json.gz etc.
Since we are internally using the daily batches for doing the processing and analysis of reports unless there is an explicit request to publish them on a test run basis we will probably end up going for option A, so don’t be shy to reply :) - ---
I agree with David in that it will be easier to access specific ooni-probe test results using option (B) (i.e. the current solution).
What benefits did you identify when considering to switch to option (A)?
A few reasons to stick with option (B) include:
- - Retaining the ability to run ooni-pipeline on a subset of reports associated with a given time period by filtering by date prefix, and substrings within key names; - - Retaining the ability to distribute small units of work easily among subprocesses; and - - Retaining the idempotent nature of ooni-pipeline, and the luigi framework - switching from lots of small files to a single large file for a given day will invariably increase the time required to recover from failures (i.e. if a small dnst-based test fails to normalise, you'll have to renormalise everything as opposed to a single test; - - Developers will not have to download hundreds of megabytes of data in order to access a traceroute test result that is only a few kilobytes in size; and - - It's generally easier to work with smaller files than it is to work with big files.
Cheers, Tyler
GPG fingerprint: 8931 45DF 609B EE2E BC32 5E71 631E 6FC3 4686 F0EB (tyler@tylerfisher.org)