[metrics-bugs] #23243 [Metrics/Website]: Write a specification for Tor web server logs

Tor Bug Tracker & Wiki blackhole at torproject.org
Mon Oct 23 18:03:19 UTC 2017


#23243: Write a specification for Tor web server logs
-----------------------------+--------------------------------
 Reporter:  iwakeh           |          Owner:  metrics-team
     Type:  enhancement      |         Status:  needs_revision
 Priority:  Medium           |      Milestone:
Component:  Metrics/Website  |        Version:
 Severity:  Normal           |     Resolution:
 Keywords:  metrics-2017     |  Actual Points:
Parent ID:                   |         Points:
 Reviewer:                   |        Sponsor:
-----------------------------+--------------------------------
Changes (by iwakeh):

 * status:  merge_ready => needs_revision


Comment:

 The spec might need to be extended:

 The implementation of the CollecTor webstats module triggered more
 questions about the way original logs are supplied.  One piece of
 information that is so far only supplied indirectly is the cue for when a
 log is finished.  In detail:

 * Functionality for bulk imports of log files is necessary.  Thus, the
 implementation cannot rely on the system date anymore to decide when a log
 day is complete.  (distinguishing between reference date as defined in the
 spec and the 'log for a day' which means all log lines for a given date
 are available).
 * Implicit assumption: input log files can be empty or not contain any
 valid lines as long as there naming pattern matches the rules.
 * The current spec allows only for one input log per reference date (per
 virtual plus physical host).
 * Log lines for a particular log day could be spread over two successive
 log files (as defined in the current spec).
 * Implicit cue: all log lines are available for a certain reference date
 when the log for the reference date and its successor are available.  This
 also means a log for a day without an immediate successor is not complete,
 i.e. won't be processed.  The cue in form of the successor could be given
 as an empty successor log file.  This cue has to be supplied from outside
 and cannot be determined from the implementation.


 Related is another question from #22428 comment:36

 > Here's another, related question: what happens if a web server rotates
 logs more often than once per day? At least that's something that we write
 in the specification. I'm not sure how this would work with file names, so
 maybe we in fact require that logs are rotated exactly once per day, and
 we just didn't write that in the specification yet. However, it seems
 rather restrictive to prescribe exact log rotation intervals in order to
 sanitize logs subsequently. Maybe we should be less restrictive here.

 It doesn't really matter, if the log lines for a certain day are spread
 over two or more input files.  Currently, only one input file per
 reference date is possible (the first wins).
 More input files could be supplied by extending the input log name pattern
 with a dash followed by an integer, i.e., `scrubbed.torproject.org-
 access.log-20171006-77.gz`.  In such a case it should be required that
 * counting starts with one (arbitrary).
 * there are no gaps, i.e., if there is a file with 3, there have to be
 files with 2 and 1 for the same virtual, physical host, and date
 combination.

 Again, a cue is needed for when the log day is complete.  As above this
 could be the input file for the immediate successor by reference date with
 number 1. And, this cue could be an empty file.

 Remarks:
 The way the cue is given is arbitrary, but the current implementation
 suggestion already works with the method described above.
 The naming pattern is just an arbitrary suggestion.  So improvements are
 welcome.

--
Ticket URL: <https://trac.torproject.org/projects/tor/ticket/23243#comment:45>
Tor Bug Tracker & Wiki <https://trac.torproject.org/>
The Tor Project: anonymity online


More information about the metrics-bugs mailing list