[metrics-bugs] #23243 [Metrics/Website]: Write a specification for Tor web server logs

Mon Oct 30 17:05:05 UTC 2017

#23243: Write a specification for Tor web server logs
-----------------------------+--------------------------------
 Reporter:  iwakeh           |          Owner:  metrics-team
     Type:  enhancement      |         Status:  needs_revision
 Priority:  Medium           |      Milestone:
Component:  Metrics/Website  |        Version:
 Severity:  Normal           |     Resolution:
 Keywords:  metrics-2017     |  Actual Points:
Parent ID:                   |         Points:
 Reviewer:                   |        Sponsor:
-----------------------------+--------------------------------

Comment (by iwakeh):

 Replying to [comment:50 karsten]:
 > While reviewing our discussion above I discovered another weakness in
 our specification: our naming convention for sanitized log files does not
 take into account that host names may include dashes. For example, there
 are virtual hosts like `cdn-fastly-backend.torproject.org` and physical
 hosts like `oo-hetzner-03.torproject.org`, which we would combine to `cdn-
 fastly-backend.torproject.org-oo-hetzner-03.torproject.org-
 access.log-20171030.xz`. Where does the virtual host name end and where
 does the physical host name begin?

 Yikes!  This is really bad, but good to become aware of it now.

 >
 > We might consider changing the naming convention to something like
 `<virtual-host>-access.log-<physical-host>-YYYYMMDD[.xz]`, but even for
 that we might find host names producing file names that cannot be parsed
 unambiguously. Maybe we'll have to return to putting phsyical host names
 in parent directory names and only virtual names in file names. Hmm.

 How much influence do we have on the naming of input files?
 Who decides about input file naming and structuring?

 For the output files, we could use other separators, maybe the underscore?

 >
 > But going back to the discussion above, I don't think we can make
 assumptions that would allow us to implement your suggestions 1 and 2
 above. Yet I don't see how suggestion 3 would be more error prone. It's up
 to us to design something that is robust, so we'll have to go through the
 possible edge cases and be prepared to handle them.

 I just want to point out that many questions or assumptions I asked about
 are not based on my preferences, but only on the fact that implementation
 and design need the unambiguous information.  Thus, a valid answer to 1 to
 3 above is, that the assumptions of 1 and 2 are not valid.  That's
 perfectly fine.

 >
 > And talking about assumptions, I feel like our one-log-per-day
 assumption is unnecessarily strong. I do agree that the naming requirement
 only permits one log file per physical host, virtual host, and date. But
 it seems like it should be up to the web server operators to decide to
 rotate logs only once per day or more often to keep log files small.

 As above:
 How much influence do we have on the structuring of logs?

 Again, I pointed out above (comment:45) that the implementation could
 accommodate this easily.  But, for a start there has to be a valid
 description of the log names to be expected.

 >
 > Here's an idea to reduce the number of edge cases: how about we simply
 ignore the date in the input log file name and only rely on the date given
 in each of the contained log lines?
 >
 > As stated earlier, we could process everything in `in/webstats/` and
 write everything to `out/` and `recent/` except the first and last
 encountered UTC days. In theory, we could even drop the log rotation
 requirement entirely.

 I'm not really worried about the edge cases.

 Initially, the date in the log name was used to ignore log lines that
 actually belong to older logs.  This was introduced because of the current
 way of processing in the shell-python implementation.

 A second requirement is that log files shouldn't change once published.
 This makes accepting the last UTC date more difficult.  There has to be
 the information when a log file is ready to be published.  And, so far we
 used the log's reference date for that purpose.
 Again, I don't have a preference, but the topic needs to be solved before
 implementation.

 >
 > This approach should work just fine for bulk processing. And for running
 several times per day we could keep a state file to avoid re-processing
 input files that we already processed before, by storing file name, last-
 modified time, and last contained UTC date. Worst case if we lose that
 state file is that we'll read everything in `in/webstats/` once again.
 >
 > Do you see any conceptual weaknesses in ignoring the date in input
 files? Do you want to give this implementation a try? Otherwise I'd be
 willing to write some proof-of-concept code to see whether this can work.

 How is the "when is a log ready for publication" question solved here?
 Thinking about the process or introducing a performance measurement like
 the stat file is taking the third step before the first.
 Or, do you imply to change the requirement of not altering sanitized logs
 once they're published?

 Conclusion:
 The implementation is not difficult either way.  Currently the
 specification (on which the implementation needs to rely) is a moving
 target.  I think we should make sure that our specification is correct and
 answers all questions.  When there is a solid specification, the
 implementation follows easily.

--
Ticket URL: <https://trac.torproject.org/projects/tor/ticket/23243#comment:51>
Tor Bug Tracker & Wiki <https://trac.torproject.org/>
The Tor Project: anonymity online