
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA256 Hi Arturo, - --- Currently for our data processing needs we have begun to bucket reports by date (every date corresponds to when a certain report has been submitted to the collector). What I would like to know is of the two following options what would be most convenient to you for accessing the data. The options are: OPTION A: Have 1 JSON stream for every day of measurements (either gzipped or plai n) ex. - https://ooni.torproject.org/reports/json/2016-01-01.json - https://ooni.torproject.org/reports/json/2016-01-02.json - https://ooni.torproject.org/reports/json/2016-01-03.json etc. OPTION B: Have 1 JSON stream for every ooni-probe test run and publish them inside of a directory with the timestamp of when it was collected ex. - - https://ooni.torproject.org/reports/json/2016-01-01/20160101T204732Z NL-AS3265-http_requests-v1-probe.json.gz - - https://ooni.torproject.org/reports/json/2016-01-01/20160101T204732Z US-AS3265-dns_consistency-v1-probe.json.gz etc. Since we are internally using the daily batches for doing the processing and analysis of reports unless there is an explicit request to publish them on a test run basis we will probably end up going for option A, so don’t be shy to reply :) - --- I agree with David in that it will be easier to access specific ooni-probe test results using option (B) (i.e. the current solution). What benefits did you identify when considering to switch to option (A)? A few reasons to stick with option (B) include: - - Retaining the ability to run ooni-pipeline on a subset of reports associated with a given time period by filtering by date prefix, and substrings within key names; - - Retaining the ability to distribute small units of work easily among subprocesses; and - - Retaining the idempotent nature of ooni-pipeline, and the luigi framework - switching from lots of small files to a single large file for a given day will invariably increase the time required to recover from failures (i.e. if a small dnst-based test fails to normalise, you'll have to renormalise everything as opposed to a single test; - - Developers will not have to download hundreds of megabytes of data in order to access a traceroute test result that is only a few kilobytes in size; and - - It's generally easier to work with smaller files than it is to work with big files. Cheers, Tyler GPG fingerprint: 8931 45DF 609B EE2E BC32 5E71 631E 6FC3 4686 F0EB (tyler@tylerfisher.org) -----BEGIN PGP SIGNATURE----- Version: GnuPG v2 iQIcBAEBCAAGBQJWn1+qAAoJEGMeb8NGhvDr6hUP/R6XcXEejwT8DYuKLoVBpujs CqXtIj88A5JYhtt1npRF/a0peNihzFRbYuQpAUX/D1EdPa1UHDwCuqp1hO642xIJ WePgWHWIS7qzYK/i5LbMXC+oWmfAA0J25SawmjyNWclK+NCgIwQ1k7kleyFP7Ul5 PFjJKLCcuqJkQl1hnZlW7YhgLYZAf2QHOD1cJauLM5aDCNBDUSgfIP+/P/xfFLq3 XqLGFBfNrMXaWmOfDGLR7tV4mS3R4M5L7rL66AiomQULdld4cuLonAht4CWLDuhV MKyKrURixRqgTUoing59OcjgOcGEVQD5P5NaMuVruU1hFbHW0wr430/mEq8pdDW9 BQHZh/VZ/f2xz4rjWiE8Mfl3mgmGfbFiT6WMKQTRY3vr5mwbmefg0/IneJ1eHtIo A/XMX579DQt3V19tMa7rO4TjpdBKIWwJ8/6mwwaw9QrS/I2pmlg8AscLU0oQtMqc 3CcWELOdoV7uIPBVg3TfiL+RLDSxzIJIp0k6IM19tkwZAxGLmD+cZvDo+dxME3fd Y7+fYxovuQrt4vvhaPFU15EDzFWHMoMqlNUSOeC0FuIhpbYbM0Dqn1qL4EScJ+PA +p/rtMTZLiiIJLtYuZCRiZjbaMvqsfZAmPEZ5ZgSShJKsB4jhV4/5LsmHedYCHeW S/IdJ81xUzTfxMUZBqgI =kztT -----END PGP SIGNATURE-----