[metrics-bugs] #23243 [Metrics/Metrics website]: write a spec for web-server-access log descriptors

Tor Bug Tracker & Wiki blackhole at torproject.org
Tue Sep 5 19:14:20 UTC 2017


#23243: write a spec for web-server-access log descriptors
-------------------------------------+------------------------------
 Reporter:  iwakeh                   |          Owner:  metrics-team
     Type:  enhancement              |         Status:  needs_review
 Priority:  Medium                   |      Milestone:
Component:  Metrics/Metrics website  |        Version:
 Severity:  Normal                   |     Resolution:
 Keywords:                           |  Actual Points:
Parent ID:                           |         Points:
 Reviewer:                           |        Sponsor:
-------------------------------------+------------------------------

Comment (by karsten):

 Okay, I tried to specify that, but please
 [https://trac.torproject.org/projects/tor/attachment/ticket/23243
 /webstats-spec.3.txt review carefully]. The part that made this a bit more
 complex was that there are actually 2 places where we need to look at
 dates/times: 1) when deciding about discarding lines that are too old or
 too new and 2) when deciding when to publish a sanitized file and never
 ever touch it again. Maybe I overcomplicated this, so if you see a way to
 simplify what I wrote, please say so!

 Here's the diff, if that helps reviewing:

 {{{
 diff --git a/webstats-spec.txt b/webstats-spec.txt
 index 7e46449..48c0287 100644
 --- a/webstats-spec.txt
 +++ b/webstats-spec.txt
 @@ -3,7 +3,6 @@ Tor webserver logs

  Next steps:
   - Replace webserver with web server which seems to be Less Bad English
 (karsten).
 - - Find out what exact delay we'll need for publishing sanitized logs
 (iwakeh?)
   - Turn this document into XML (karsten)
   - Code the decisions (iwakeh)
   - Try out the code on actual logs (iwakeh; karsten can make more logs
 available)
 @@ -30,6 +29,8 @@ LogFormat "0.0.0.2 - %u %{[%d/%b/%Y:00:00:00 %z]}t
 \"%r\" %>s %b \"%{Referer}i\"

  The main difference to Apache's Common Log Format is that request IP
 addresses are removed and the field is instead used to encode whether the
 request came in via http:// (0.0.0.0), via https:// (0.0.0.1), or via the
 site's onion service (0.0.0.2).

 +Tor's webservers are configured to use UTC as timezone, which is also
 highly recommended when rewriting request times to "00:00:00" in order for
 the subsequent sanitizing steps to work correctly. Alternatively, if the
 system timezone is not set to UTC, webservers should keep request times
 unchanged and let them be handled by the subsequent sanitizing steps.
 +
  Tor's webservers are configured to rotate logs at least once per day,
 which does not necessarily happen at 00:00:00 UTC. As a result, log files
 may contain requests from up to two UTC days and several log files may
 contain requests that have been started on the same UTC day.

  All access log files written by Tor's webservers follow the naming
 convention <hostname>.torproject.org-access.log-YYYYMMDD.
 @@ -48,6 +49,8 @@ Log files are expected to contain exactly 1 request per
 line. We process these f

   - Lines begin with Apache's Common Log Format ("%h %l %u %t \"%r\" %>s
 %b") or a compatible format like one of Tor's privacy formats. It is
 acceptable if lines start with a format that is compatible to the Common
 Log Format and continue with additional fields. Those additional fields
 will later be discarded, but the line will not be discarded because of
 them.
   - The request IP address starts with "0.0.0.", followed by any number
 between 0 and 255.
 + - The time the request was received does not lie in the future.
 + - The date the request was received, after converting the request time
 to UTC, does not lie more than 1 day in the past. (Bulk imports of
 archived logs are exempt from this requirement.)
   - The request protocol is HTTP.
   - The request method is either GET or HEAD.
   - The final status of the request is neither 400 ("Bad Request") nor 404
 ("Not Found").
 @@ -80,9 +83,7 @@ Sanitized log files may additionally be sorted into
 directories by virtual host

  <virtual-host>/YYYY/MM/<virtual-host>-<physical-host>-access.log-
 YYYYMMDD[.xz]

 -Due to the fact that the date when a log file was rotated and the start
 date of contained requests may not always overlap, we need to delay
 publishing sanitized log files until all log files containing requests
 from that date are guaranteed to be processed. After this delay, the
 sanitized log files are published and not further modified.
 -
 -XXX What's the delay? End of UTC day + 24 hours? Check current script!
 +Due to the fact that the date when a log file was rotated and the start
 date of contained requests may not always overlap, we need to delay
 publishing sanitized log files until the start date of requests in UTC
 plus 2 days. After this delay, all log files containing requests from that
 date are assumed to be processed. Sanitized log files are published and
 not further modified in the future. (Again, bulk imports of archived logs
 are exempt from this.)

  As last and certainly not least important sanitizing step, all rewritten
 log lines are sorted alphabetically, so that request order cannot be
 inferred from sanitized log files.
 }}}

 If you think it's good, I'll continue with the remaining next steps.
 Thanks!

--
Ticket URL: <https://trac.torproject.org/projects/tor/ticket/23243#comment:12>
Tor Bug Tracker & Wiki <https://trac.torproject.org/>
The Tor Project: anonymity online


More information about the metrics-bugs mailing list