[tor-bugs] #23243 [Metrics/Website]: Write a specification for Tor web server logs

Tor Bug Tracker & Wiki blackhole at torproject.org
Mon Oct 30 15:18:33 UTC 2017


#23243: Write a specification for Tor web server logs
-----------------------------+--------------------------------
 Reporter:  iwakeh           |          Owner:  metrics-team
     Type:  enhancement      |         Status:  needs_revision
 Priority:  Medium           |      Milestone:
Component:  Metrics/Website  |        Version:
 Severity:  Normal           |     Resolution:
 Keywords:  metrics-2017     |  Actual Points:
Parent ID:                   |         Points:
 Reviewer:                   |        Sponsor:
-----------------------------+--------------------------------

Comment (by karsten):

 While reviewing our discussion above I discovered another weakness in our
 specification: our naming convention for sanitized log files does not take
 into account that host names may include dashes. For example, there are
 virtual hosts like `cdn-fastly-backend.torproject.org` and physical hosts
 like `oo-hetzner-03.torproject.org`, which we would combine to `cdn-
 fastly-backend.torproject.org-oo-hetzner-03.torproject.org-
 access.log-20171030.xz`. Where does the virtual host name end and where
 does the physical host name begin?

 We might consider changing the naming convention to something like
 `<virtual-host>-access.log-<physical-host>-YYYYMMDD[.xz]`, but even for
 that we might find host names producing file names that cannot be parsed
 unambiguously. Maybe we'll have to return to putting phsyical host names
 in parent directory names and only virtual names in file names. Hmm.

 But going back to the discussion above, I don't think we can make
 assumptions that would allow us to implement your suggestions 1 and 2
 above. Yet I don't see how suggestion 3 would be more error prone. It's up
 to us to design something that is robust, so we'll have to go through the
 possible edge cases and be prepared to handle them.

 And talking about assumptions, I feel like our one-log-per-day assumption
 is unnecessarily strong. I do agree that the naming requirement only
 permits one log file per physical host, virtual host, and date. But it
 seems like it should be up to the web server operators to decide to rotate
 logs only once per day or more often to keep log files small.

 Here's an idea to reduce the number of edge cases: how about we simply
 ignore the date in the input log file name and only rely on the date given
 in each of the contained log lines?

 As stated earlier, we could process everything in `in/webstats/` and write
 everything to `out/` and `recent/` except the first and last encountered
 UTC days. In theory, we could even drop the log rotation requirement
 entirely.

 This approach should work just fine for bulk processing. And for running
 several times per day we could keep a state file to avoid re-processing
 input files that we already processed before, by storing file name, last-
 modified time, and last contained UTC date. Worst case if we lose that
 state file is that we'll read everything in `in/webstats/` once again.

 Do you see any conceptual weaknesses in ignoring the date in input files?
 Do you want to give this implementation a try? Otherwise I'd be willing to
 write some proof-of-concept code to see whether this can work.

--
Ticket URL: <https://trac.torproject.org/projects/tor/ticket/23243#comment:50>
Tor Bug Tracker & Wiki <https://trac.torproject.org/>
The Tor Project: anonymity online


More information about the tor-bugs mailing list