[ooni-dev] Minor measurements API feature request: order_by=index
art at torproject.org
Sat Jul 1 07:31:17 UTC 2017
Thanks for feedback.
More replies inline:
On June 27, 2017 at 2:45:56 AM, David Fifield (david at bamsoftware.com) wrote:
Currently you can do queries with order_by=test_start_time,
order_by=probe_cc, etc., but you cannot do order_by=index.
"error_message": "Invalid order_by"
As I understand it, the difference between index and test_start_time is
that index is always increasing over time (newly uploaded reports always
get a higher index than existing reports), while newly uploaded reports
can have a test_start_time that is in the past (if the probe was not
able to upload for a time, for example).
Yes this is correct and it’s actually something that is not uncommon.
The ability to order_by=index would allow a slight robustness
enhancement in ooni-sync, in the case when a new report is uploaded
while ooni-sync is running. Currently ooni-sync always does
That is, starting with the oldest reports, get a page of 1000 reports at
a time. The issue is what happens when a report from the past is
uploaded while ooni-sync is downloading. In this case ooni-sync will not
notice the new report right away. Here is an example with made-up
indexes and dates:
ooni-sync starts downloading page 0 from index=5000 (2016-01-01) to index=5999 (2016-03-31)
new report with index=9999 (2016-02-01) appears, gets inserted into page 0
ooni-sync finishes downloading page 0
ooni-sync starts downloading page 1 from index=5999 (2016-03-31) to index=6998 (2016-04-05)
ooni-sync finishes downloading page 1
In this example, ooni-sync never downloads the report with index=9999.
Also, it sees index=5999 twice, because index=9999 pushed index=5999
from page 0 to page 1.
An order_by=index option would prevent newly uploaded reports from
unaligning the pages like that (at least when order_by=asc is used).
I perfectly understand the issue you are talking about and in fact we actually already support order_by=“index”, though it should probably be documented.
One thing to keep in mind, though, is that while we are ok with guaranteeing that index is an ever-increasing number, it’s not useful as a unique identifier of that report.
That is to say that we can fairly easily ensure index is always increasing, but it’s much harder to ensure that it maps one to one with reports, so you need to ensure that you are keeping track of whether or not you are already downloaded a particular report (which I believe ooni-sync does).
One thing to keep in mind is that as we roll-over to the new measurements API we are going to be resetting indexes and we shall start counting again from `max(previous_measurements_api_index)`. This means that when the new ooni-measurements API will be deployed you will say since_index=$last_index_you_say and you will actually get back all measurements since ever.
Or put differently in the new pipeline all new indexes will be offset by the highest last index that we saw.
The reason why we are going to have to do this is that otherwise we will have to map the current indexes to the new ones (the new database is backed by the pipeline).
If you anticipate this creating unexpected issues, please let us know and we can maybe find some other solution.
The reasons why this is minor minor minor and hardly worth mentioning:
* index=9999 will get downloaded the next time you run ooni-sync
* it can't cause ooni-sync to skip any already uploaded reports (it
would, with order=desc, but that's why ooni-sync uses order=asc)
* ooni-sync will see but won't actually download index=5999 twice
* newly uploaded reports are likely to be on the last page anyway
Sounds good, let me know if the existing feature will make ooni-sync work better and if what I said above is going to create issues for you.
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the ooni-dev