[tor-bugs] #22428 [Metrics/CollecTor]: add webstats module to collector

Tor Bug Tracker & Wiki blackhole at torproject.org
Wed Sep 20 10:45:30 UTC 2017


#22428: add webstats module to collector
-------------------------------+-----------------------------------
 Reporter:  iwakeh             |          Owner:  iwakeh
     Type:  enhancement        |         Status:  needs_information
 Priority:  Medium             |      Milestone:  CollecTor 1.4.0
Component:  Metrics/CollecTor  |        Version:
 Severity:  Normal             |     Resolution:
 Keywords:                     |  Actual Points:
Parent ID:                     |         Points:
 Reviewer:                     |        Sponsor:
-------------------------------+-----------------------------------

Comment (by iwakeh):

 Replying to [comment:25 karsten]:
 > I'm not too deep into this topic right now, so handle the following
 comment with care.

 Will do :-) but this is really asked from a meta perspective and the main
 difference to all other descriptors is that logfiles should only be
 published after their completion (i.e., two days after their dates).

 >
 > I wonder if we can avoid having that directory for temporary log files
 that cannot be published yet. It seems like a possible source for trouble
 when processing breaks at some point and we need to fix that, with half of
 a log file being written to the temporary directory and the rest still
 being in the import directory.

 I would want to have a separation here, b/c log files from the import
 directory are not sanitized and should not be published.  On the other
 hand, log descriptors in the temporary location are possibly not yet
 complete and already published log descriptors should not be altered
 (e.g., by appending and resorting them).  If anything breaks, the
 incompletely written files in the working dir ought to be removed.
 I would also want to avoid the stats file.

 >
 > Maybe we can simplify that by keeping a text file in `stats/` where we
 keep some state which files we already read or wrote. And we only write a
 file to `out/` and `recent/` when it's ready for publication. Not sure if
 this will solve all cases, but it seems potentially easier to understand
 for future operators of this service (including ourselves when we don't
 remember these design discussions anymore).

 The explanation could be more elaborate and maybe the property renamed to
 WebstatsSanitizingPath or some better name?
 The stats file option could be misleading. For example, if another local
 re-import leads to overwriting a sanitized not yet published log.  A
 temporary sanitizing-working directory makes clear that only CollecTor
 touches files in there and stuff from 'in' could be removed after a
 processing round.  That ought to be easier for operation: "don't touch the
 sanitizing-working directory" and treat the input directory as with other
 modules?

 >
 > Regarding `WebstatsReferenceDate`, it would be good to explain in the
 comments when this value needs to be changed, and to what value. The
 comment alone should be sufficient to know how to use the property,
 without further looking at the code.

 I was thinking that it might be useful to be able to have partial imports
 of older logs, hmm.  This might be trickier than just documenting the
 property.  Example: add all July 2017 logs to in and set reference date to
 20170801 means that only logs up to (incl.) 20170730 are published in that
 round.

 Let's extend this discussion:
 What other operation scenarios the webstats module will have to be
 prepared to deal with and do these have to be available with the initial
 release of webstats?

--
Ticket URL: <https://trac.torproject.org/projects/tor/ticket/22428#comment:26>
Tor Bug Tracker & Wiki <https://trac.torproject.org/>
The Tor Project: anonymity online


More information about the tor-bugs mailing list