Currently you can do queries with order_by=test_start_time, order_by=probe_cc, etc., but you cannot do order_by=index.
https://measurements.ooni.torproject.org/api/v1/files?limit=1&order_by=i... { "error_code": 400, "error_message": "Invalid order_by" }
As I understand it, the difference between index and test_start_time is that index is always increasing over time (newly uploaded reports always get a higher index than existing reports), while newly uploaded reports can have a test_start_time that is in the past (if the probe was not able to upload for a time, for example).
The ability to order_by=index would allow a slight robustness enhancement in ooni-sync, in the case when a new report is uploaded while ooni-sync is running. Currently ooni-sync always does order=asc&order_by=test_start_time&limit=1000 That is, starting with the oldest reports, get a page of 1000 reports at a time. The issue is what happens when a report from the past is uploaded while ooni-sync is downloading. In this case ooni-sync will not notice the new report right away. Here is an example with made-up indexes and dates: ooni-sync starts downloading page 0 from index=5000 (2016-01-01) to index=5999 (2016-03-31) new report with index=9999 (2016-02-01) appears, gets inserted into page 0 ooni-sync finishes downloading page 0 ooni-sync starts downloading page 1 from index=5999 (2016-03-31) to index=6998 (2016-04-05) ooni-sync finishes downloading page 1 In this example, ooni-sync never downloads the report with index=9999. Also, it sees index=5999 twice, because index=9999 pushed index=5999 from page 0 to page 1.
An order_by=index option would prevent newly uploaded reports from unaligning the pages like that (at least when order_by=asc is used).
The reasons why this is minor minor minor and hardly worth mentioning: * index=9999 will get downloaded the next time you run ooni-sync * it can't cause ooni-sync to skip any already uploaded reports (it would, with order=desc, but that's why ooni-sync uses order=asc) * ooni-sync will see but won't actually download index=5999 twice * newly uploaded reports are likely to be on the last page anyway
Hi David,
Thanks for feedback.
More replies inline: On June 27, 2017 at 2:45:56 AM, David Fifield (david@bamsoftware.com) wrote:
Currently you can do queries with order_by=test_start_time, order_by=probe_cc, etc., but you cannot do order_by=index.
https://measurements.ooni.torproject.org/api/v1/files?limit=1&order_by=i... { "error_code": 400, "error_message": "Invalid order_by" }
As I understand it, the difference between index and test_start_time is that index is always increasing over time (newly uploaded reports always get a higher index than existing reports), while newly uploaded reports can have a test_start_time that is in the past (if the probe was not able to upload for a time, for example).
Yes this is correct and it’s actually something that is not uncommon.
The ability to order_by=index would allow a slight robustness enhancement in ooni-sync, in the case when a new report is uploaded while ooni-sync is running. Currently ooni-sync always does order=asc&order_by=test_start_time&limit=1000 That is, starting with the oldest reports, get a page of 1000 reports at a time. The issue is what happens when a report from the past is uploaded while ooni-sync is downloading. In this case ooni-sync will not notice the new report right away. Here is an example with made-up indexes and dates: ooni-sync starts downloading page 0 from index=5000 (2016-01-01) to index=5999 (2016-03-31) new report with index=9999 (2016-02-01) appears, gets inserted into page 0 ooni-sync finishes downloading page 0 ooni-sync starts downloading page 1 from index=5999 (2016-03-31) to index=6998 (2016-04-05) ooni-sync finishes downloading page 1 In this example, ooni-sync never downloads the report with index=9999. Also, it sees index=5999 twice, because index=9999 pushed index=5999 from page 0 to page 1.
An order_by=index option would prevent newly uploaded reports from unaligning the pages like that (at least when order_by=asc is used).
I perfectly understand the issue you are talking about and in fact we actually already support order_by=“index”, though it should probably be documented.
One thing to keep in mind, though, is that while we are ok with guaranteeing that index is an ever-increasing number, it’s not useful as a unique identifier of that report.
That is to say that we can fairly easily ensure index is always increasing, but it’s much harder to ensure that it maps one to one with reports, so you need to ensure that you are keeping track of whether or not you are already downloaded a particular report (which I believe ooni-sync does).
One thing to keep in mind is that as we roll-over to the new measurements API we are going to be resetting indexes and we shall start counting again from `max(previous_measurements_api_index)`. This means that when the new ooni-measurements API will be deployed you will say since_index=$last_index_you_say and you will actually get back all measurements since ever.
Or put differently in the new pipeline all new indexes will be offset by the highest last index that we saw.
The reason why we are going to have to do this is that otherwise we will have to map the current indexes to the new ones (the new database is backed by the pipeline).
If you anticipate this creating unexpected issues, please let us know and we can maybe find some other solution.
The reasons why this is minor minor minor and hardly worth mentioning: * index=9999 will get downloaded the next time you run ooni-sync * it can't cause ooni-sync to skip any already uploaded reports (it would, with order=desc, but that's why ooni-sync uses order=asc) * ooni-sync will see but won't actually download index=5999 twice * newly uploaded reports are likely to be on the last page anyway Sounds good, let me know if the existing feature will make ooni-sync work better and if what I said above is going to create issues for you.
~ Arturo
On Sat, Jul 01, 2017 at 09:31:17AM +0200, Arturo Filastò wrote:
I perfectly understand the issue you are talking about and in fact we actually already support order_by=“index”, though it should probably be documented.
How do I use it? This request gives me status code 400.
https://measurements.ooni.torproject.org/api/v1/files?limit=10&probe_cc=... { "error_code": 400, "error_message": "Invalid order_by" }
On Sat, Jul 01, 2017 at 09:31:17AM +0200, Arturo Filastò wrote:
I perfectly understand the issue you are talking about and in fact we actually already support order_by=“index”, though it should probably be documented.
How do I use it? This request gives me status code 400.
https://measurements.ooni.torproject.org/api/v1/files?limit=10&probe_cc=... { "error_code": 400, "error_message": "Invalid order_by" }
Hum, in theory that should work too. I suspect it’s an issue with unicode vs str matching not working as expected.
In any case, in the meantime you this should work (using idx instead of index):
https://measurements.ooni.torproject.org/api/v1/files?limit=10&probe_cc=...
~ Arturo
On Sat, Jul 01, 2017 at 10:08:09PM +0200, Arturo Filastò wrote:
On Sat, Jul 01, 2017 at 09:31:17AM +0200, Arturo Filastò wrote: > I perfectly understand the issue you are talking about and in fact we actually > already support order_by=“index”, though it should probably be documented. How do I use it? This request gives me status code 400. https://measurements.ooni.torproject.org/api/v1/files?limit=10&probe_cc=US& order_by=index { "error_code": 400, "error_message": "Invalid order_by" }
Hum, in theory that should work too. I suspect it’s an issue with unicode vs str matching not working as expected.
In any case, in the meantime you this should work (using idx instead of index):
https://measurements.ooni.torproject.org/api/v1/files?limit=10&probe_cc=...
Cool, thanks for the information. order_by=idx is working for me.