[ooni-dev] ooni-pipeline using local time zone when parsing timestamps

David Fifield david at bamsoftware.com
Thu Mar 17 06:04:34 UTC 2016


Parts of the ooni-pipeline are using datetime.fromtimestamp instead of
datetime.utcfromtimestamp. It means the timestamp will be parsed
differently according to the time zone in which the code is run:

https://github.com/TheTorProject/ooni-pipeline/blob/355ac1780f1f05eefb9ea3bf5b5c0148904e888c/pipeline/batch/daily_workflow.py#L521
        entry['test_start_time'] = datetime.fromtimestamp(entry.pop('start_time',
                                        0)).strftime("%Y-%m-%d %H:%M:%S")

https://github.com/TheTorProject/ooni-pipeline/blob/355ac1780f1f05eefb9ea3bf5b5c0148904e888c/pipeline/batch/daily_workflow.py#L583
        start_time = datetime.fromtimestamp(header.get('start_time', 0))

Now, it appears that there's no actual harm done by this bug, because it
looks like the timestamps were read and written by someone in UTC+1, so
the conversion evens out in the final JSON. But it could go wrong if run
by someone in a different time zone.

I noticed this because when I process the new JSON files, I get
timestamps that are an hour different than I got parsing the YAML files.
To parse the JSON times, I use datetime.datetime.strptime(..., "%Y-%m-%d %H:%M:%S"),
which gives the correct answer because of the reason in the previous
paragraph. My old YAML times were off by an hour, because I used
datetime.datetime.utcfromtimestamp(...), parsing a UTC+1 timestamp as if
it were UTC.


Here's some related information I found while researching the above. It
looks like individual test results are *also* using local timestamps, or
have their clocks set wrong, or something. Because when I compare the
"test_start_time" field to the "Date" header (which is always supposed
to be UTC) in a response to an http_requests test, I get some widely
divergent values (I saw up to around 9 hours).

For example consider
https://measurements.ooni.torproject.org/2015-01-01/20150101T060029Z-BE-AS29182-http_requests-no_report_id-0.1.0-probe.json
It has a difference of about 6 hours:
	{
		"test_start_time":"2015-01-01 07:00:29",
		"test_keys": {
			"requests": [
				{
					"response": {
						"headers": {
							"Date": "Thu, 01 Jan 2015 01:00:35 GMT"
							...
						}
					}
				}
			]
		},
		probe_cc: "BE"
	}
probe_cc is "BE", Belgium, which is supposed to be UTC+1, so I don't
know where the 6-hour discrepancy is coming from. Note also that the
timestamp in the file name is not the same: 060029.

Here's a summary of the http_requests reports from 2015-01-01. Most of
them match, but some are off by 6 or 1 hour:

CC: CH  test_start_time: 2015-01-01 03:00:28  Date: Thu, 01 Jan 2015 02:59:14 GMT  +0
CC: BE  test_start_time: 2015-01-01 07:00:29  Date: Thu, 01 Jan 2015 01:00:35 GMT  +6
CC: FR  test_start_time: 2015-01-01 09:41:08  Date: Thu, 01 Jan 2015 09:41:14 GMT  +0
CC: FR  test_start_time: 2015-01-01 10:47:20  Date: Thu, 01 Jan 2015 10:47:23 GMT  +0
CC: NL  test_start_time: 2015-01-01 12:42:16  Date: Thu, 01 Jan 2015 12:42:21 GMT  +0
CC: NL  test_start_time: 2015-01-01 15:41:32  Date: Thu, 01 Jan 2015 15:41:44 GMT  +0
CC: CZ  test_start_time: 2015-01-01 15:46:27  Date: Thu, 01 Jan 2015 15:46:34 GMT  +0
CC: NL  test_start_time: 2015-01-01 17:19:39  Date: Thu, 01 Jan 2015 17:19:43 GMT  +0
CC: HK  test_start_time: 2015-01-01 18:03:57  Date: Thu, 01 Jan 2015 17:04:10 GMT  +1
CC: NL  test_start_time: 2015-01-01 21:47:32  Date: Thu, 01 Jan 2015 21:47:37 GMT  +0

I used this script:

import json
import sys
for line in sys.stdin:
    doc = json.loads(line)
    try:
        probe_cc = doc["probe_cc"]
        test_start_time = doc["test_start_time"]
        for request in doc["test_keys"]["requests"]:
            response_time = request["response"]["headers"]["Date"]
            print "CC: %s  test_start_time: %s  Date: %s" % (probe_cc, test_start_time, response_time)
    except KeyError:
        pass


More information about the ooni-dev mailing list