[tor-bugs] #23243 [Metrics/Website]: Write a specification for Tor web server logs

Tor Bug Tracker & Wiki blackhole at torproject.org
Thu Nov 2 14:04:35 UTC 2017

#23243: Write a specification for Tor web server logs
 Reporter:  iwakeh           |          Owner:  metrics-team
     Type:  enhancement      |         Status:  needs_revision
 Priority:  Medium           |      Milestone:
Component:  Metrics/Website  |        Version:
 Severity:  Normal           |     Resolution:
 Keywords:  metrics-2017     |  Actual Points:
Parent ID:                   |         Points:
 Reviewer:                   |        Sponsor:

Comment (by karsten):

 Replying to [comment:53 iwakeh]:
 > Replying to [comment:52 karsten]:
 > > We ''couldn't'' update the output files for 2017-11-03 and 2017-11-04
 anymore! We would simply leave them unchanged, containing just the
 requests we processed earlier.
 > How to find out that 2017-11-04 is already there:  only the lookup in
 'out' and 'recent' could tell.

 Correct, plus maybe a 'state' file with previously sanitized logs
 (implementation detail). But how else would we find out that we already
 sanitized and published a log file? That's something only we can know

 > > But is this a bug we should be able to handle? It seems like a bug in
 the log-copying script combined with bad timing. During normal operation
 and in the bulk-import case this should not happen.
 > During a bulk import it might be harder to guarantee the correct order.
 Hmm, but that should be manageable somehow ...

 Oh! Now I understand what you mean by bulk import. Like all log files for
 August, then all log files for September, etc.? Yes, that would be
 problematic. I'd say that one would have to supply some days from the
 previous and the next interval and discard any output files for days
 outside of the currently processed interval. That could work. But it's
 error-prone. And it's not exactly our use case, because we only have logs
 from the past 2 weeks.

 But regardless of bulk import or not, I think we'll have to parse input
 log files twice; once to find out which dates are contained and another
 time to sanitize them. But, implementation detail.

 > > Note that if you think that cutting off the first and last days is not
 enough, we could easily change that to cutting off the first and last two
 days. Or the first and the last two. Or first and last three. Whatever we
 think works best.
 > That cut-off time could be kept variable and be adjusted later, true.

 Good idea.

 > Summary:
 > * Only hostnames are inferred from the logs' names and paths.
 > * The 'reference date' used in the current spec is dropped.
 > * Only the log line dates covered in one run become the reference
 interval, of which a certain amount at beginning and end is not processed
 (aka: cut-off time).
 > * Sanitized files for dates, that are already available in 'out', are
 //not// overwritten or amended and corresponding log lines ignored.
 > * Gaps in import logs cannot be filled in later.
 > * File provision for (bulk) imports has to insure proper order.
 > * Use a placeholder for sanitized log file names (starting with
 underscore, but easily changeable).
 > Does this seem solid?
 > Shall I amend the spec with these changes?

 It seems solid! Yes, please! Thanks!

Ticket URL: <https://trac.torproject.org/projects/tor/ticket/23243#comment:54>
Tor Bug Tracker & Wiki <https://trac.torproject.org/>
The Tor Project: anonymity online

More information about the tor-bugs mailing list