[tor-bugs] #23243 [Metrics/Metrics website]: write a spec for web-server-access log descriptors

Tor Bug Tracker & Wiki blackhole at torproject.org
Tue Sep 5 20:10:34 UTC 2017


#23243: write a spec for web-server-access log descriptors
-------------------------------------+------------------------------
 Reporter:  iwakeh                   |          Owner:  metrics-team
     Type:  enhancement              |         Status:  needs_review
 Priority:  Medium                   |      Milestone:
Component:  Metrics/Metrics website  |        Version:
 Severity:  Normal                   |     Resolution:
 Keywords:                           |  Actual Points:
Parent ID:                           |         Points:
 Reviewer:                           |        Sponsor:
-------------------------------------+------------------------------

Comment (by iwakeh):

 The actual date (or system date) is only of concern for publishing the
 logs.  All other dates refer to the date the (original) log is finalized.
 I introduced the term 'reference date' for this.
 The diff:
 {{{
 --- webstats-spec.3.txt
 +++ webstats-spec.4.txt
 @@ -33,7 +33,7 @@

  Tor's webservers are configured to rotate logs at least once per day,
 which does not necessarily happen at 00:00:00 UTC. As a result, log files
 may contain requests from up to two UTC days and several log files may
 contain requests that have been started on the same UTC day.

 -All access log files written by Tor's webservers follow the naming
 convention <hostname>.torproject.org-access.log-YYYYMMDD.
 +All access log files written by Tor's webservers follow the naming
 convention <hostname>.torproject.org-access.log-YYYYMMDD, where 'YYYYMMDD'
 is the date of the rotation and finalization of the log file.  This date
 will be referred to as 'reference date' in the following sections.

  # Sanitizing steps

 @@ -41,16 +41,16 @@

  ## Discarding non-matching files

 -As first safeguard against publishing log files that are too sensitive,
 we discard all files not matching the naming convention for access logs.
 This is to prevent, for example, error logs from slipping through.
 +As first safeguard against publishing log files that are too sensitive,
 we discard all files not matching the naming convention for access logs.
 This is to prevent, for example, error logs from slipping through.  In
 addition, the log file's name is supposed to contain the reference date,
 which is used to determine the validity of log lines.  If the log file's
 name doesn't end in a date string of the format 'YYYYMMDD' the entire file
 is discarded.

  ## Discarding non-matching lines

 -Log files are expected to contain exactly 1 request per line. We process
 these files line by line and discard any lines not matching the following
 criteria:
 +Log files are expected to contain exactly 1 request per line.  We process
 these files line by line and discard any lines not matching the following
 criteria:

   - Lines begin with Apache's Common Log Format ("%h %l %u %t \"%r\" %>s
 %b") or a compatible format like one of Tor's privacy formats. It is
 acceptable if lines start with a format that is compatible to the Common
 Log Format and continue with additional fields. Those additional fields
 will later be discarded, but the line will not be discarded because of
 them.
   - The request IP address starts with "0.0.0.", followed by any number
 between 0 and 255.
 - - The time the request was received does not lie in the future.
 - - The date the request was received, after converting the request time
 to UTC, does not lie more than 1 day in the past. (Bulk imports of
 archived logs are exempt from this requirement.)
 + - The time the request was received does not lie in the future of the
 reference date.
 + - The date the request was received, after converting the request time
 to UTC, does not lie more than 1 day in the past of the reference date.
   - The request protocol is HTTP.
   - The request method is either GET or HEAD.
   - The final status of the request is neither 400 ("Bad Request") nor 404
 ("Not Found").
 @@ -83,7 +83,7 @@

  <virtual-host>/YYYY/MM/<virtual-host>-<physical-host>-access.log-
 YYYYMMDD[.xz]

 -Due to the fact that the date when a log file was rotated and the start
 date of contained requests may not always overlap, we need to delay
 publishing sanitized log files until the start date of requests in UTC
 plus 2 days. After this delay, all log files containing requests from that
 date are assumed to be processed. Sanitized log files are published and
 not further modified in the future. (Again, bulk imports of archived logs
 are exempt from this.)
 +Due to the fact that the date when a log file was rotated and the start
 date of contained requests may not always overlap, we need to delay
 publishing sanitized log files until the start date of requests in UTC
 plus 2 days. After this delay, all log files containing requests from that
 date are assumed to be processed.

  As last and certainly not least important sanitizing step, all rewritten
 log lines are sorted alphabetically, so that request order cannot be
 inferred from sanitized log files.
 }}}

 And, I don't see the necessity for stating that the files won't be changed
 in future.  This doesn't seem part of a spec here.  Anyway, we might want
 to re-sanitize these files, if suddenly there is a privacy issue with
 fields that seem benign now (as with bridge descriptors, for example).

--
Ticket URL: <https://trac.torproject.org/projects/tor/ticket/23243#comment:13>
Tor Bug Tracker & Wiki <https://trac.torproject.org/>
The Tor Project: anonymity online


More information about the tor-bugs mailing list