[tor-commits] [spec/master] Update to reflect normalisations performed in pipeline the specs of the base and the HTTP data formats

art at torproject.org art at torproject.org
Mon May 9 17:57:26 UTC 2016


commit 2e519c8da7946467e3615f05d0da1a12be920fe8
Author: Arturo Filastò <arturo at filasto.net>
Date:   Wed Jan 27 12:04:53 2016 +0100

    Update to reflect normalisations performed in pipeline the specs of the base and the HTTP data formats
---
 data-formats/df-000-base.md  | 166 +++++++++++-----------
 data-formats/df-001-httpt.md | 328 ++++++++++++++++++++++++++++---------------
 2 files changed, 301 insertions(+), 193 deletions(-)

diff --git a/data-formats/df-000-base.md b/data-formats/df-000-base.md
index d0d5378..c1431e3 100644
--- a/data-formats/df-000-base.md
+++ b/data-formats/df-000-base.md
@@ -14,113 +14,116 @@ In this directory shall go only the data format specifications of Test
 templates. The Test specific data formats should go in the specification of the
 test.
 
-All data produced from ooniprobe tests is in YAML formatted.
+Data produced by the ooniprobe client can be either in YAML or JSON format.
 
-Every test that is interested in reporting with ooniprobe MUST use such data
-format.
+YAML is used when writing reports to the users filesystem, while JSON is used
+as a format for the published processed reports.
 
 # Base test data format
 
 This specification is of the basic data format common to all ooniprobe test
 outputs.
 
-Data Format Version: df-000-base-001
+Every entry contains the following common fields. The `test_keys` key will
+contain instead all the keys that are specific to the test in question.
+
+The JSON data format is made up of a series of JSON documents separated by
+newline characters.
+
+Data Format Version: 0.2.0
 
 ## Specification
 
-    ###########################################
-    # OONI Probe Report for HTTP Requests test
-    # Wed Jan 30 21:03:56 2013
-    ###########################################
-    ---
-    options:
-      A dict containing the keys and values of options passed to the test.
+```
+{
+    "input": "If the test takes an input this will contain the value of it"
+        " these can be for example URLs, hostnames, IPs, etc.",
 
-    probe_asn:
-      The AS Number of the probe (prefixed by AS, ex. AS1234) or null if includeasn is set to false.
+    "input_hashes": "A list of the SHA256 hash of encoded "
+        "as hex of the inputs to this test.",
 
-    probe_cc:
-      The two letter country code of the probe or null if inlcudecc is set to false.
+    "id": "This is an identifier of this particular measurement",
 
-    probe_ip:
-      The IPv4 address of the probe or null if includeip is set to false.
+    "bucket_date": "A date in the format of %Y-%m-%d that indicates "
+        "when the report was processed by the data pipeline"
 
-    software_name:
-      The name of the software that has generated such report (ex. ooniprobe).
+    "data_format_version": "0.1.0|0.2.0",
 
-    software_version:
-      The version of the software that has generated such report (ex. 0.0.10).
+    "report_filename": "{bucket_date}/{timestamp as '%Y%m%dT%h%M%sZ'}-{probe_cc}-{probe_asn}-{test_name}-{report_id}-{data_format_version}-{probe|backend}.json",
 
-    start_time:
-      The time at which the test was started in seconds since epoch.
+    "options": ["A list of options passed to the test as command line arguments"],
 
-    test_name:
-      The name of the test that such report is for (ex. HTTP Requests).
+    "probe_asn": "The AS Number of the probe (prefixed by AS, ex. AS1234) "
+        "or AS0 if includeasn is set to false.",
 
-    test_version:
-      The version of the test that such report is for (ex. 0.0.10).
+    "probe_cc": "The two letter country code of the probe or ZZ if "
+        "inlcudecountry is set to false.",
 
-    data_format_version:
-      The version string of the data format being used by the test (ex. httpt-000)
-    
-    report_id:
-      A 64 character mixed case string that is generated by the client used to identify the report.
+    "probe_ip": "The IPv4 address of the probe or 127.0.0.1 if "
+        "includeip is set to false.",
 
-    test_helpers:
-      A dictionary with as keys the names of the options and values the addresses of the test helpers used
-    ...
+    "probe_ip": "The name of the city of the probe or null if "
+        "includecity is set to false.",
 
-# Example output
+    "report_id": "20140130T111423Z_ELNkuajQzUWfktBupbfZUxseQDczEvEaIhtciykhoLSuiNiCCV",
 
-    ###########################################
-    # OONI Probe Report for HTTP Invalid Request Line test
-    # Mon Jan 28 21:33:59 2013
-    ###########################################
-    ---
-    options:
-      collector: null
-      help: 0
-      logfile: null
-      parallelism: '10'
-      pcapfile: null
-      reportfile: null
-      resume: 0
-      subargs: [-b, 'http://93.95.227.200']
-      test: nettests/manipulation/http_invalid_request_line.py
-    probe_asn: null
-    probe_cc: null
-    probe_ip: null
-    software_name: ooniprobe
-    software_version: 0.0.10
-    start_time: 1359401639.0
-    test_name: HTTP Invalid Request Line
-    test_version: 0.1.3
-    test_helpers: {backend: "http://93.95.227.200"}
-    report_id: xxxxxxxXXXxXXXxxxxxxxxxxxxXXXxXXXxsxxxXXXxXXXxxxXXXxXXXxxxXXXxX
-    ...
-
-
-# Report Entry data format
-
-Every iteration over an input given to a test will produce a Report Entry.
-
-A Report Entry is a YAML Stream as specified here:
-http://www.yaml.org/spec/1.2/spec.html#id2801681
-
-Here are specified the keys that will always be present inside of every report
-entry.
+    "software_name": "The name of the software that has generated "
+        "such report (ex. ooniprobe)",
 
-## Specification
+    "software_version": "The version of the software used to generate this report",
+
+    "backend_version": "The version of the backend that collected this measurement",
+
+    "test_helpers": null,
 
-input:
-  The item we this specific test instance is referring to. null in case no
-  input is being iterated over.
+    "test_name": "The name of the test that generated "
+        "this measurement (ex. http_requests)",
 
-test_runtime:
-  `float` the runtime of the test
+    "test_version": "",
 
-test_start_time:
-  `float` seconds since epoch from the starting of the test.
+    "test_runtime": null,
+
+    "test_start_time": "Timestamp of when the measurement was performed in "
+        "UTC time coordinates (ex. 2015-08-24 12:02:23)",
+
+    "test_keys": {
+        "The keys that are specific to the test"
+    }
+}
+```
+
+# Example output
+
+```
+{
+    "bucket_date": "2015-11-22",
+    "data_format_version": "0.2.0",
+    "id": "07873c37-9441-47e3-93b8-94db10444c64",
+    "input": "http://example.com/",
+    "options": [
+        "-f",
+        "37e60e13536f6afe47a830bfb6b371b5cf65da66d7ad65137344679b24fdccd1"
+    ],
+    "probe_asn": "AS0",
+    "probe_cc": "CH",
+    "probe_ip": "127.0.0.1",
+    "report_filename": "2015-11-22/20151122T103202Z-CH-AS0-http_requests-XsQk40qrhgvJEdbXAUFzYjbbGCBuEsc1UV5RAAFXo4hysiUo3qyTfo4NTr7MjiwN-0.1.0-probe.json",
+    "report_id": "XsQk40qrhgvJEdbXAUFzYjbbGCBuEsc1UV5RAAFXo4hysiUo3qyTfo4NTr7MjiwN",
+    "software_name": "ooniprobe",
+    "software_version": "1.3.1",
+    "backend_version": "1.1.4",
+    "test_helpers": {},
+    "input_hashes": [
+        "37e60e13536f6afe47a830bfb6b371b5cf65da66d7ad65137344679b24fdccd1"
+    ],
+    "test_name": "http_requests",
+    "test_runtime": 0.1842639446,
+    "test_start_time": "2015-11-22 10:32:02",
+    "test_version": "0.2.4"
+    "test_keys": {
+    },
+}
+```
 
 # Error strings
 
@@ -164,4 +167,3 @@ error_string:
 * This will be the error message if the task has timed out: `task_timed_out`
 
 * Every other failure: 'unknown_failure %s' % str(failure.value)
-
diff --git a/data-formats/df-001-httpt.md b/data-formats/df-001-httpt.md
index ed8fb63..8479a6d 100644
--- a/data-formats/df-001-httpt.md
+++ b/data-formats/df-001-httpt.md
@@ -1,6 +1,6 @@
 # HTTPTest template data format
 
-Data Format Version: df-001-httpt-000
+Data Format Version: 0.2.0
 
 This is the specification of the data format that every test that is
 based on ooni.templates.httpt.HTTPTest shall be using.
@@ -10,116 +10,222 @@ data format.
 
 ## Specification
 
-    ---
-    requests:
-      - request:
-          headers:
-            `dict` the headers of the request
-          body:
-            `string` the body of the response
-
-          url:
-            `string` the URL of the request being made (if prefixed with 's' it means
-            the request was made via the Tor SOCKS proxy)
-
-        method:
-            `string` the HTTP method being used
-
-        response:
-          headers:
-            `dict` the headers of the response
-          body:
-            `string` the body of the response
-
-          code:
-            `int` the response status code
-
-        failure:
-            `string` (optional) this will be set if an error was returned.
-            For a list of error messages see the Error strings section of
-            df-000-base.md.
-
-      - request:
-            etc. etc.
-
-    socksproxy:
-      null if no socks proxy was used for this request or an IP port
-      combination (as a string) if a SOCKS proxy was used.
-
-    agent:
-      either 'agent' if 30X redirects should not be followed or 'redirect' if
-      they should be followed.
-
-    ...
+```
+"agent": "agent|redirect depending on weither the client "
+    "will ignore 30X redirects or follow them.",
+
+"socksproxy": "null | IP:PORT of the socksproxy to be used to "
+    "perform the experiment requests on",
+
+"requests": [
+    {
+        "failure": "This will contain an error string for why the "
+            "request failed or null if no failure occurred",
+
+        "request": {
+            "body": "If the request of the client contains some payload it "
+                "will be in here, otherwise it is set to null",
+
+            "headers": {
+                "Header-Name": "Header-Value"
+            },
+
+            "method": "GET|POST|PUT",
+            "tor": {
+                "exit_ip": "The address of the Tor exit used for the request or "
+                    "null if Tor was not used or the test was run with an older version of ooniprobe.",
+
+                "exit_name": "The name of the Tor exit used for the request or "
+                    "null if Tor was not used or the test was run with an older version of ooniprobe.",
+
+                "is_tor": "true|false depending on wether or not "
+                    "this request was done over Tor or not."
+            },
+            "url": "The URL of the request that has been performed."
+        },
+        "response": {
+            "body": "The body of the response or null if not response was found. If the response is binary "
+                "then this will be a dictionary containing the format in which the binary data is encoded and "
+                "the encoded data (ex. {"format": "base64", "data": "AQI="}). "
+                "Currently the only type of format supported is base64.",
+
+            "headers": {
+                "Header-Name": "Header-Value"
+            }
+        },
+        "response_length": null
+    }
+]
+```
 
 ## Example output
 
-    input: http://google.com/
-    agent: agent
-    requests:
-    - request:
-        body: null
-        headers:
-        - - User-Agent
-          - - &id001 [Mozilla/4.0 (compatible; MSIE 5.0; Windows NT 5.1; .NET CLR 1.1.4322),
-              'Internet Explorer 5, Windows XP']
-        method: GET
-        url: http://google.com/
-      response:
-        body: ''
-        code: 301
-        headers:
-        - - Content-Length
-          - ['219']
-        - - X-XSS-Protection
-          - [1; mode=block]
-        - - Expires
-          - ['Tue, 29 Jan 2013 14:29:19 GMT']
-        - - Server
-          - [gws]
-        - - Connection
-          - [close]
-        - - Location
-          - ['http://www.google.com/']
-        - - Cache-Control
-          - ['public, max-age=2592000']
-        - - Date
-          - ['Sun, 30 Dec 2012 14:29:19 GMT']
-        - - X-Frame-Options
-          - [SAMEORIGIN]
-        - - Content-Type
-          - [text/html; charset=UTF-8]
-    - request:
-        body: null
-        headers:
-        - - User-Agent
-          - - *id001
-        method: GET
-        url: shttp://google.com/
-      response:
-        body: ''
-        code: 301
-        headers:
-        - - Content-Length
-          - ['219']
-        - - X-XSS-Protection
-          - [1; mode=block]
-        - - Expires
-          - ['Tue, 29 Jan 2013 14:29:20 GMT']
-        - - Server
-          - [gws]
-        - - Connection
-          - [close]
-        - - Location
-          - ['http://www.google.com/']
-        - - Cache-Control
-          - ['public, max-age=2592000']
-        - - Date
-          - ['Sun, 30 Dec 2012 14:29:20 GMT']
-        - - X-Frame-Options
-          - [SAMEORIGIN]
-        - - Content-Type
-          - [text/html; charset=UTF-8]
-    socksproxy: null
-
-
+```
+{
+    "bucket_date": "2015-11-22",
+    "data_format_version": "0.2.0",
+    "id": "07873c37-9441-47e3-93b8-94db10444c64",
+    "input": "http://googleusercontent.com/",
+    "options": [
+        "-f",
+        "37e60e13536f6afe47a830bfb6b371b5cf65da66d7ad65137344679b24fdccd1"
+    ],
+    "probe_asn": "AS0",
+    "probe_cc": "CH",
+    "probe_ip": "127.0.0.1",
+    "report_filename": "2015-11-22/20151122T103202Z-CH-AS0-http_requests-XsQk40qrhgvJEdbXAUFzYjbbGCBuEsc1UV5RAAFXo4hysiUo3qyTfo4NTr7MjiwN-0.1.0-probe.json",
+    "report_id": "XsQk40qrhgvJEdbXAUFzYjbbGCBuEsc1UV5RAAFXo4hysiUo3qyTfo4NTr7MjiwN",
+    "software_name": "ooniprobe",
+    "software_version": "1.3.1",
+    "test_helpers": {},
+    "backend_version": "1.1.4",
+    "input_hashes": [
+        "37e60e13536f6afe47a830bfb6b371b5cf65da66d7ad65137344679b24fdccd1"
+    ],
+    "probe_city": null,
+    "test_name": "http_requests",
+    "test_runtime": 0.1842639446,
+    "test_start_time": "2015-11-22 10:32:02",
+    "test_version": "0.2.4"
+    "test_keys": {
+        "agent": "agent",
+        "body_length_match": null,
+        "body_proportion": null,
+        "control_failure": "socks_host_unreachable",
+        "experiment_failure": "dns_lookup_error",
+        "factor": 0.8,
+        "headers_diff": null,
+        "headers_match": null,
+        "requests": [
+            {
+                "failure": "dns_lookup_error",
+                "request": {
+                    "body": null,
+                    "headers": {
+                        "User-Agent": "Mozilla/5.0 (Windows; U; Windows NT 6.1; de; rv:1.9.2) Gecko/20100115 Firefox/3.6"
+                    },
+                    "method": "GET",
+                    "tor": {
+                        "exit_ip": false,
+                        "exit_name": false,
+                        "is_tor": false
+                    },
+                    "url": "http://googleusercontent.com/"
+                },
+                "response": {
+                    "body": null,
+                    "headers": {}
+                },
+                "response_length": null
+            },
+            {
+                "failure": "socks_host_unreachable",
+                "request": {
+                    "body": null,
+                    "headers": {
+                        "User-Agent": "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.1.7) Gecko/20091221 Firefox/3.5.7"
+                    },
+                    "method": "GET",
+                    "tor": {
+                        "exit_ip": null,
+                        "exit_name": null,
+                        "is_tor": true
+                    },
+                    "url": "http://googleusercontent.com/"
+                },
+                "response": {
+                    "body": null,
+                    "headers": {}
+                },
+                "response_length": null
+            },
+            {
+                "failure": "dns_lookup_error",
+                "request": {
+                    "body": null,
+                    "headers": {
+                        "User-Agent": "Mozilla/5.0 (Windows; U; Windows NT 6.1; de; rv:1.9.2) Gecko/20100115 Firefox/3.6"
+                    },
+                    "method": "GET",
+                    "tor": {
+                        "exit_ip": null,
+                        "exit_name": null,
+                        "is_tor": false
+                    },
+                    "url": "http://googleusercontent.com/"
+                },
+                "response": {
+                    "body": null,
+                    "headers": {}
+                },
+                "response_length": null
+            },
+            {
+                "failure": "dns_lookup_error",
+                "request": {
+                    "body": null,
+                    "headers": {
+                        "User-Agent": "Mozilla/5.0 (Windows; U; Windows NT 6.1; de; rv:1.9.2) Gecko/20100115 Firefox/3.6"
+                    },
+                    "method": "GET",
+                    "tor": {
+                        "exit_ip": null,
+                        "exit_name": null,
+                        "is_tor": false
+                    },
+                    "url": "http://googleusercontent.com/"
+                },
+                "response": {
+                    "body": null,
+                    "headers": {}
+                },
+                "response_length": null
+            },
+            {
+                "failure": "socks_host_unreachable",
+                "request": {
+                    "body": null,
+                    "headers": {
+                        "User-Agent": "Mozilla/5.0 (Windows; U; Windows NT 6.1; de; rv:1.9.2) Gecko/20100115 Firefox/3.6"
+                    },
+                    "method": "GET",
+                    "tor": {
+                        "exit_ip": null,
+                        "exit_name": null,
+                        "is_tor": true
+                    },
+                    "url": "http://googleusercontent.com/"
+                },
+                "response": {
+                    "body": null,
+                    "headers": {}
+                },
+                "response_length": null
+            },
+            {
+                "failure": "socks_host_unreachable",
+                "request": {
+                    "body": null,
+                    "headers": {
+                        "User-Agent": "Mozilla/5.0 (Windows; U; Windows NT 6.1; de; rv:1.9.2) Gecko/20100115 Firefox/3.6"
+                    },
+                    "method": "GET",
+                    "tor": {
+                        "exit_ip": null,
+                        "exit_name": null,
+                        "is_tor": true
+                    },
+                    "url": "http://googleusercontent.com/"
+                },
+                "response": {
+                    "body": null,
+                    "headers": {}
+                },
+                "response_length": null
+            }
+        ],
+        "socksproxy": null,
+        "start_time": 1448184722.0
+    }
+}
+```





More information about the tor-commits mailing list