Parts of the ooni-pipeline are using datetime.fromtimestamp instead of datetime.utcfromtimestamp. It means the timestamp will be parsed differently according to the time zone in which the code is run:
https://github.com/TheTorProject/ooni-pipeline/blob/355ac1780f1f05eefb9ea3bf... entry['test_start_time'] = datetime.fromtimestamp(entry.pop('start_time', 0)).strftime("%Y-%m-%d %H:%M:%S")
https://github.com/TheTorProject/ooni-pipeline/blob/355ac1780f1f05eefb9ea3bf... start_time = datetime.fromtimestamp(header.get('start_time', 0))
Now, it appears that there's no actual harm done by this bug, because it looks like the timestamps were read and written by someone in UTC+1, so the conversion evens out in the final JSON. But it could go wrong if run by someone in a different time zone.
I noticed this because when I process the new JSON files, I get timestamps that are an hour different than I got parsing the YAML files. To parse the JSON times, I use datetime.datetime.strptime(..., "%Y-%m-%d %H:%M:%S"), which gives the correct answer because of the reason in the previous paragraph. My old YAML times were off by an hour, because I used datetime.datetime.utcfromtimestamp(...), parsing a UTC+1 timestamp as if it were UTC.
Here's some related information I found while researching the above. It looks like individual test results are *also* using local timestamps, or have their clocks set wrong, or something. Because when I compare the "test_start_time" field to the "Date" header (which is always supposed to be UTC) in a response to an http_requests test, I get some widely divergent values (I saw up to around 9 hours).
For example consider https://measurements.ooni.torproject.org/2015-01-01/20150101T060029Z-BE-AS29... It has a difference of about 6 hours: { "test_start_time":"2015-01-01 07:00:29", "test_keys": { "requests": [ { "response": { "headers": { "Date": "Thu, 01 Jan 2015 01:00:35 GMT" ... } } } ] }, probe_cc: "BE" } probe_cc is "BE", Belgium, which is supposed to be UTC+1, so I don't know where the 6-hour discrepancy is coming from. Note also that the timestamp in the file name is not the same: 060029.
Here's a summary of the http_requests reports from 2015-01-01. Most of them match, but some are off by 6 or 1 hour:
CC: CH test_start_time: 2015-01-01 03:00:28 Date: Thu, 01 Jan 2015 02:59:14 GMT +0 CC: BE test_start_time: 2015-01-01 07:00:29 Date: Thu, 01 Jan 2015 01:00:35 GMT +6 CC: FR test_start_time: 2015-01-01 09:41:08 Date: Thu, 01 Jan 2015 09:41:14 GMT +0 CC: FR test_start_time: 2015-01-01 10:47:20 Date: Thu, 01 Jan 2015 10:47:23 GMT +0 CC: NL test_start_time: 2015-01-01 12:42:16 Date: Thu, 01 Jan 2015 12:42:21 GMT +0 CC: NL test_start_time: 2015-01-01 15:41:32 Date: Thu, 01 Jan 2015 15:41:44 GMT +0 CC: CZ test_start_time: 2015-01-01 15:46:27 Date: Thu, 01 Jan 2015 15:46:34 GMT +0 CC: NL test_start_time: 2015-01-01 17:19:39 Date: Thu, 01 Jan 2015 17:19:43 GMT +0 CC: HK test_start_time: 2015-01-01 18:03:57 Date: Thu, 01 Jan 2015 17:04:10 GMT +1 CC: NL test_start_time: 2015-01-01 21:47:32 Date: Thu, 01 Jan 2015 21:47:37 GMT +0
I used this script:
import json import sys for line in sys.stdin: doc = json.loads(line) try: probe_cc = doc["probe_cc"] test_start_time = doc["test_start_time"] for request in doc["test_keys"]["requests"]: response_time = request["response"]["headers"]["Date"] print "CC: %s test_start_time: %s Date: %s" % (probe_cc, test_start_time, response_time) except KeyError: pass