[ooni-dev] Request for feedback on resuming of publishing of the reports

Wed Jan 20 10:21:44 UTC 2016

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA256

Hi Arturo,

- ---
Currently for our data processing needs we have begun to bucket
reports by date (every date corresponds to when a certain report has
been submitted to the collector). What I would like to know is of the
two following options what would be most convenient to you for
accessing the data.

The options are:

OPTION A:
Have 1 JSON stream for every day of measurements (either gzipped or plai
n)

ex.
 - https://ooni.torproject.org/reports/json/2016-01-01.json
 - https://ooni.torproject.org/reports/json/2016-01-02.json
 - https://ooni.torproject.org/reports/json/2016-01-03.json
etc.

OPTION B:
Have 1 JSON stream for every ooni-probe test run and publish them
inside of a directory with the timestamp of when it was collected

ex.
- - https://ooni.torproject.org/reports/json/2016-01-01/20160101T204732Z
NL-AS3265-http_requests-v1-probe.json.gz
- - https://ooni.torproject.org/reports/json/2016-01-01/20160101T204732Z
US-AS3265-dns_consistency-v1-probe.json.gz
etc.

Since we are internally using the daily batches for doing the
processing and analysis of reports unless there is an explicit request
to publish them on a test run basis we will probably end up going for
option A, so don’t be shy to reply :)
- ---

I agree with David in that it will be easier to access specific
ooni-probe test results using option (B) (i.e. the current solution).

What benefits did you identify when considering to switch to option (A)?

A few reasons to stick with option (B) include:

- - Retaining the ability to run ooni-pipeline on a subset of reports
associated with a given time period by filtering by date prefix, and
substrings within key names;
- - Retaining the ability to distribute small units of work easily among
subprocesses; and
- - Retaining the idempotent nature of ooni-pipeline, and the luigi
framework - switching from lots of small files to a single large file
for a given day will invariably increase the time required to recover
from failures (i.e. if a small dnst-based test fails to normalise,
you'll have to renormalise everything as opposed to a single test;
- - Developers will not have to download hundreds of megabytes of data
in order to access a traceroute test result that is only a few
kilobytes in size; and
- - It's generally easier to work with smaller files than it is to work
with big files.

Cheers,
Tyler

GPG fingerprint: 8931 45DF 609B EE2E BC32  5E71 631E 6FC3 4686 F0EB
(tyler at tylerfisher.org)
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2

iQIcBAEBCAAGBQJWn1+qAAoJEGMeb8NGhvDr6hUP/R6XcXEejwT8DYuKLoVBpujs
CqXtIj88A5JYhtt1npRF/a0peNihzFRbYuQpAUX/D1EdPa1UHDwCuqp1hO642xIJ
WePgWHWIS7qzYK/i5LbMXC+oWmfAA0J25SawmjyNWclK+NCgIwQ1k7kleyFP7Ul5
PFjJKLCcuqJkQl1hnZlW7YhgLYZAf2QHOD1cJauLM5aDCNBDUSgfIP+/P/xfFLq3
XqLGFBfNrMXaWmOfDGLR7tV4mS3R4M5L7rL66AiomQULdld4cuLonAht4CWLDuhV
MKyKrURixRqgTUoing59OcjgOcGEVQD5P5NaMuVruU1hFbHW0wr430/mEq8pdDW9
BQHZh/VZ/f2xz4rjWiE8Mfl3mgmGfbFiT6WMKQTRY3vr5mwbmefg0/IneJ1eHtIo
A/XMX579DQt3V19tMa7rO4TjpdBKIWwJ8/6mwwaw9QrS/I2pmlg8AscLU0oQtMqc
3CcWELOdoV7uIPBVg3TfiL+RLDSxzIJIp0k6IM19tkwZAxGLmD+cZvDo+dxME3fd
Y7+fYxovuQrt4vvhaPFU15EDzFWHMoMqlNUSOeC0FuIhpbYbM0Dqn1qL4EScJ+PA
+p/rtMTZLiiIJLtYuZCRiZjbaMvqsfZAmPEZ5ZgSShJKsB4jhV4/5LsmHedYCHeW
S/IdJ81xUzTfxMUZBqgI
=kztT
-----END PGP SIGNATURE-----