[tor-bugs] #23243 [Metrics/Website]: Write a specification for Tor web server logs
Tor Bug Tracker & Wiki
blackhole at torproject.org
Thu Nov 2 14:04:35 UTC 2017
#23243: Write a specification for Tor web server logs
Reporter: iwakeh | Owner: metrics-team
Type: enhancement | Status: needs_revision
Priority: Medium | Milestone:
Component: Metrics/Website | Version:
Severity: Normal | Resolution:
Keywords: metrics-2017 | Actual Points:
Parent ID: | Points:
Reviewer: | Sponsor:
Comment (by karsten):
Replying to [comment:53 iwakeh]:
> Replying to [comment:52 karsten]:
> > We ''couldn't'' update the output files for 2017-11-03 and 2017-11-04
anymore! We would simply leave them unchanged, containing just the
requests we processed earlier.
> How to find out that 2017-11-04 is already there: only the lookup in
'out' and 'recent' could tell.
Correct, plus maybe a 'state' file with previously sanitized logs
(implementation detail). But how else would we find out that we already
sanitized and published a log file? That's something only we can know
> > But is this a bug we should be able to handle? It seems like a bug in
the log-copying script combined with bad timing. During normal operation
and in the bulk-import case this should not happen.
> During a bulk import it might be harder to guarantee the correct order.
Hmm, but that should be manageable somehow ...
Oh! Now I understand what you mean by bulk import. Like all log files for
August, then all log files for September, etc.? Yes, that would be
problematic. I'd say that one would have to supply some days from the
previous and the next interval and discard any output files for days
outside of the currently processed interval. That could work. But it's
error-prone. And it's not exactly our use case, because we only have logs
from the past 2 weeks.
But regardless of bulk import or not, I think we'll have to parse input
log files twice; once to find out which dates are contained and another
time to sanitize them. But, implementation detail.
> > Note that if you think that cutting off the first and last days is not
enough, we could easily change that to cutting off the first and last two
days. Or the first and the last two. Or first and last three. Whatever we
think works best.
> That cut-off time could be kept variable and be adjusted later, true.
> * Only hostnames are inferred from the logs' names and paths.
> * The 'reference date' used in the current spec is dropped.
> * Only the log line dates covered in one run become the reference
interval, of which a certain amount at beginning and end is not processed
(aka: cut-off time).
> * Sanitized files for dates, that are already available in 'out', are
//not// overwritten or amended and corresponding log lines ignored.
> * Gaps in import logs cannot be filled in later.
> * File provision for (bulk) imports has to insure proper order.
> * Use a placeholder for sanitized log file names (starting with
underscore, but easily changeable).
> Does this seem solid?
> Shall I amend the spec with these changes?
It seems solid! Yes, please! Thanks!
Ticket URL: <https://trac.torproject.org/projects/tor/ticket/23243#comment:54>
Tor Bug Tracker & Wiki <https://trac.torproject.org/>
The Tor Project: anonymity online
More information about the tor-bugs