[metrics-bugs] #23243 [Metrics/Website]: Write a specification for Tor web server logs

Tor Bug Tracker & Wiki blackhole at torproject.org
Tue Oct 24 12:15:55 UTC 2017


#23243: Write a specification for Tor web server logs
-----------------------------+--------------------------------
 Reporter:  iwakeh           |          Owner:  metrics-team
     Type:  enhancement      |         Status:  needs_revision
 Priority:  Medium           |      Milestone:
Component:  Metrics/Website  |        Version:
 Severity:  Normal           |     Resolution:
 Keywords:  metrics-2017     |  Actual Points:
Parent ID:                   |         Points:
 Reviewer:                   |        Sponsor:
-----------------------------+--------------------------------

Comment (by iwakeh):

 Replying to [comment:46 karsten]:
 > I'm not sure if we can resolve these questions by hard thinking.

 Well, we need to work on thoughtful decision making.  There're not that
 many questions above except yours:
 > ... what happens if a web server rotates logs more often than once per
 day? At least that's something that we write in the specification. I'm not
 sure how this would work with file names, so maybe we in fact require that
 logs are rotated exactly once per day, and we just didn't write that in
 the specification yet. However, it seems rather restrictive to prescribe
 exact log rotation intervals in order to sanitize logs subsequently. Maybe
 we should be less restrictive here.

 The current webstat code and the spec require a log per day.  So, if
 someone decides to change the log rotation to more than that, the spec and
 code will have to be adapted.  Thus, it seems this question refers to a
 hypothetical change (afaik).  In comment:45 I point out that this is a
 small issue for implementation based on the reasoning that rotated logs
 usually add a number or a time or both to the log file name.  Either way
 is easily adapted.

 >  - Would it help to know the log and log rotation configuration used on
 the various Tor web servers?

 Unless you have reason to think that current logging procedures are going
 to change or even changed already from the one log per day schema, this is
 not necessary.

 >  - Would it help to have access to the current host that sanitizes web
 server logs?

 I think there are no questions regarding the current process.

 >  - Does the existing code for sanitizing web server logs contain any
 more hints on the input data?

 We put all the information from the current code into the spec and the
 current implementation suggestion.  The old code also uses a 'cue' (as
 mentioned comment:45):  `sanitize.py` returns the sanitized log file name
 for the day before the processed log file, which is the cue for the
 calling shell script that this file is now complete and can be published.



 Both the old and suggested new version of webstats need an outside cue and
 without this cue an input log day would not be published.

 Now focussing on the new implementation of the webstats module for
 CollecTor there are several ways of preventing log file loss:

 1. Make sure by outside means that there is no day without a log (e.g. by
 providing an empty file for that day using 'touch').  This would work
 without additional implementation for CollecTor and this works for bulk
 imports as well as daily processing.  As a result there will be a
 sanitized log for each day offered by CollecTor, some might be empty.
 2. For bulk processing a property could signal CollecTor to use all logs
 without insisting on an uninterrupted chain.  This still requires outside
 measures for making sure no log lines are lost and might result in days
 without any logs, unless CollecTor creates empty ones.
 3. Think out a mechanism that enables more automated processing of an
 interrupted chain of logs.  This seems error prone an will result in many
 edge cases.

 I think 1. is the easiest in terms of operation, i.e., providing input
 logs, and implementation (it's there already).  In addition, the
 uninterupted chain of (possibly empty) sanitized logs is also easy to
 verify and understand.  An empty file could result from no log line being
 valid or no log being available for that day.

 So, in order to get forward one of the above methods needs to be chosen
 (or a new one made up).
 The other question about smaller log rotation intervals is only relevant,
 if that is put into practice.  If so, it should be a straightforward task
 to adapt the code.

 Hope this makes some sense. Is there anything else missing here?

--
Ticket URL: <https://trac.torproject.org/projects/tor/ticket/23243#comment:47>
Tor Bug Tracker & Wiki <https://trac.torproject.org/>
The Tor Project: anonymity online


More information about the metrics-bugs mailing list