[tor-bugs] #22428 [Metrics/CollecTor]: add webstats module to collector

Wed Sep 20 15:19:08 UTC 2017

#22428: add webstats module to collector
-------------------------------+-----------------------------------
 Reporter:  iwakeh             |          Owner:  iwakeh
     Type:  enhancement        |         Status:  needs_information
 Priority:  High               |      Milestone:  CollecTor 1.4.0
Component:  Metrics/CollecTor  |        Version:
 Severity:  Normal             |     Resolution:
 Keywords:                     |  Actual Points:
Parent ID:                     |         Points:
 Reviewer:                     |        Sponsor:
-------------------------------+-----------------------------------

Comment (by karsten):

 Replying to [comment:26 iwakeh]:
 > Replying to [comment:25 karsten]:
 > > I'm not too deep into this topic right now, so handle the following
 comment with care.
 >
 > Will do :-) but this is really asked from a meta perspective and the
 main difference to all other descriptors is that logfiles should only be
 published after their completion (i.e., two days after their dates).

 Well, discussing these things, even from a meta perspective, requires
 (sometimes: deep) thinking. Just saying, keep in mind that you're much,
 much deeper into this topic than I am at this time.

 > > I wonder if we can avoid having that directory for temporary log files
 that cannot be published yet. It seems like a possible source for trouble
 when processing breaks at some point and we need to fix that, with half of
 a log file being written to the temporary directory and the rest still
 being in the import directory.
 >
 > I would want to have a separation here, b/c log files from the import
 directory are not sanitized and should not be published.  On the other
 hand, log descriptors in the temporary location are possibly not yet
 complete and already published log descriptors should not be altered
 (e.g., by appending and resorting them).

 I think I agree on all statements above.

 > If anything breaks, the incompletely written files in the working dir
 ought to be removed.

 Wait, is that correct? What if the sanitizing process aborts at a random
 point, possibly due to the host losing power in the middle of a run? Would
 the operator simply delete everything from the working directory in that
 case?

 I'm pointing this out, because I was in the situation with the old
 webstats code. The code seemed plausible while writing it. But that
 changed a few months later when something broke and I had to figure out
 how to recover. I'd like to avoid such situations with this new code.

 > I would also want to avoid the stats file.

 Yes, if we can avoid it, let's avoid it. Maybe we can use one just to
 avoid re-processing files unnecessarily. So, if it's empty or gone,
 nothing bad happens.

 > > Maybe we can simplify that by keeping a text file in `stats/` where we
 keep some state which files we already read or wrote. And we only write a
 file to `out/` and `recent/` when it's ready for publication. Not sure if
 this will solve all cases, but it seems potentially easier to understand
 for future operators of this service (including ourselves when we don't
 remember these design discussions anymore).
 >
 > The explanation could be more elaborate and maybe the property renamed
 to WebstatsSanitizingPath or some better name?
 > The stats file option could be misleading. For example, if another local
 re-import leads to overwriting a sanitized not yet published log.

 Not sure I understand what you mean here.

 > A temporary sanitizing-working directory makes clear that only CollecTor
 touches files in there

 ... except when the operator needs to repair something and has to touch
 these files, too.

 > and stuff from 'in' could be removed after a processing round.

 No, we shouldn't remove anything from `in/`. We didn't put files there, so
 we shouldn't remove them, either.

 > That ought to be easier for operation: "don't touch the sanitizing-
 working directory" and treat the input directory as with other modules?

 A few thoughts on how this might work using a stats file:
  - We read files from `in/`.
  - We write fully sanitized files to `out/` and `recent/`, but only if
 we're certain that we won't have more data later on that would require us
 to update files there, because we wouldn't do that.
  - If there's a file in `in/` that contains lines that we couldn't put
 into a file in `out/` and `recent/`, we will simply process that
 `in/`-file again next time.
  - We might want to use a file in `stats/` to remember which files in
 `in/` are already completely processed, so that we can skip them.
  - We never delete anything from `in/` but let the script do that that
 also places files in there.

 Please note that I'm not sure yet whether this will work. It just seems
 like something that is relatively easy to operate, in particular when
 something breaks.

 > > Regarding `WebstatsReferenceDate`, it would be good to explain in the
 comments when this value needs to be changed, and to what value. The
 comment alone should be sufficient to know how to use the property,
 without further looking at the code.
 >
 > I was thinking that it might be useful to be able to have partial
 imports of older logs, hmm.  This might be trickier than just documenting
 the property.  Example: add all July 2017 logs to in and set reference
 date to 20170801 means that only logs up to (incl.) 20170730 are published
 in that round.
 >
 > Let's extend this discussion:
 > What other operation scenarios the webstats module will have to be
 prepared to deal with and do these have to be available with the initial
 release of webstats?

 I think the following scenarios are most common:
  - Initialize by processing log files from the past 2 weeks up to now.
 Similarly, re-process in case of change to sanitizing steps.
  - Do a periodic run every few hours.

 Note that there are no archives of non-sanitized logs that would reach
 back more than 2 weeks. That's different with bridge descriptor tarballs.

--
Ticket URL: <https://trac.torproject.org/projects/tor/ticket/22428#comment:28>
Tor Bug Tracker & Wiki <https://trac.torproject.org/>
The Tor Project: anonymity online