[tor-bugs] #22428 [Metrics/CollecTor]: add webstats module to collector
Tor Bug Tracker & Wiki
blackhole at torproject.org
Wed Sep 20 15:19:08 UTC 2017
#22428: add webstats module to collector
Reporter: iwakeh | Owner: iwakeh
Type: enhancement | Status: needs_information
Priority: High | Milestone: CollecTor 1.4.0
Component: Metrics/CollecTor | Version:
Severity: Normal | Resolution:
Keywords: | Actual Points:
Parent ID: | Points:
Reviewer: | Sponsor:
Comment (by karsten):
Replying to [comment:26 iwakeh]:
> Replying to [comment:25 karsten]:
> > I'm not too deep into this topic right now, so handle the following
comment with care.
> Will do :-) but this is really asked from a meta perspective and the
main difference to all other descriptors is that logfiles should only be
published after their completion (i.e., two days after their dates).
Well, discussing these things, even from a meta perspective, requires
(sometimes: deep) thinking. Just saying, keep in mind that you're much,
much deeper into this topic than I am at this time.
> > I wonder if we can avoid having that directory for temporary log files
that cannot be published yet. It seems like a possible source for trouble
when processing breaks at some point and we need to fix that, with half of
a log file being written to the temporary directory and the rest still
being in the import directory.
> I would want to have a separation here, b/c log files from the import
directory are not sanitized and should not be published. On the other
hand, log descriptors in the temporary location are possibly not yet
complete and already published log descriptors should not be altered
(e.g., by appending and resorting them).
I think I agree on all statements above.
> If anything breaks, the incompletely written files in the working dir
ought to be removed.
Wait, is that correct? What if the sanitizing process aborts at a random
point, possibly due to the host losing power in the middle of a run? Would
the operator simply delete everything from the working directory in that
I'm pointing this out, because I was in the situation with the old
webstats code. The code seemed plausible while writing it. But that
changed a few months later when something broke and I had to figure out
how to recover. I'd like to avoid such situations with this new code.
> I would also want to avoid the stats file.
Yes, if we can avoid it, let's avoid it. Maybe we can use one just to
avoid re-processing files unnecessarily. So, if it's empty or gone,
nothing bad happens.
> > Maybe we can simplify that by keeping a text file in `stats/` where we
keep some state which files we already read or wrote. And we only write a
file to `out/` and `recent/` when it's ready for publication. Not sure if
this will solve all cases, but it seems potentially easier to understand
for future operators of this service (including ourselves when we don't
remember these design discussions anymore).
> The explanation could be more elaborate and maybe the property renamed
to WebstatsSanitizingPath or some better name?
> The stats file option could be misleading. For example, if another local
re-import leads to overwriting a sanitized not yet published log.
Not sure I understand what you mean here.
> A temporary sanitizing-working directory makes clear that only CollecTor
touches files in there
... except when the operator needs to repair something and has to touch
these files, too.
> and stuff from 'in' could be removed after a processing round.
No, we shouldn't remove anything from `in/`. We didn't put files there, so
we shouldn't remove them, either.
> That ought to be easier for operation: "don't touch the sanitizing-
working directory" and treat the input directory as with other modules?
A few thoughts on how this might work using a stats file:
- We read files from `in/`.
- We write fully sanitized files to `out/` and `recent/`, but only if
we're certain that we won't have more data later on that would require us
to update files there, because we wouldn't do that.
- If there's a file in `in/` that contains lines that we couldn't put
into a file in `out/` and `recent/`, we will simply process that
`in/`-file again next time.
- We might want to use a file in `stats/` to remember which files in
`in/` are already completely processed, so that we can skip them.
- We never delete anything from `in/` but let the script do that that
also places files in there.
Please note that I'm not sure yet whether this will work. It just seems
like something that is relatively easy to operate, in particular when
> > Regarding `WebstatsReferenceDate`, it would be good to explain in the
comments when this value needs to be changed, and to what value. The
comment alone should be sufficient to know how to use the property,
without further looking at the code.
> I was thinking that it might be useful to be able to have partial
imports of older logs, hmm. This might be trickier than just documenting
the property. Example: add all July 2017 logs to in and set reference
date to 20170801 means that only logs up to (incl.) 20170730 are published
in that round.
> Let's extend this discussion:
> What other operation scenarios the webstats module will have to be
prepared to deal with and do these have to be available with the initial
release of webstats?
I think the following scenarios are most common:
- Initialize by processing log files from the past 2 weeks up to now.
Similarly, re-process in case of change to sanitizing steps.
- Do a periodic run every few hours.
Note that there are no archives of non-sanitized logs that would reach
back more than 2 weeks. That's different with bridge descriptor tarballs.
Ticket URL: <https://trac.torproject.org/projects/tor/ticket/22428#comment:28>
Tor Bug Tracker & Wiki <https://trac.torproject.org/>
The Tor Project: anonymity online
More information about the tor-bugs