[tor-bugs] #22428 [Metrics/CollecTor]: Add webstats module

Tor Bug Tracker & Wiki blackhole at torproject.org
Thu Oct 19 14:39:27 UTC 2017


#22428: Add webstats module
-------------------------------+---------------------------------
 Reporter:  iwakeh             |          Owner:  iwakeh
     Type:  enhancement        |         Status:  needs_revision
 Priority:  High               |      Milestone:  CollecTor 1.5.0
Component:  Metrics/CollecTor  |        Version:
 Severity:  Normal             |     Resolution:
 Keywords:  metrics-2017       |  Actual Points:
Parent ID:                     |         Points:
 Reviewer:                     |        Sponsor:
-------------------------------+---------------------------------
Changes (by karsten):

 * status:  needs_review => needs_revision


Comment:

 Alright, I finished an initial review of commit 086e904 in your
 task-22428-4 branch. I have several trivial or minor findings, but I'd
 like to postpone them until we have resolved one that I consider major:

 I'm unclear whether the sibling approach is robust enough to cover all
 cases and edge cases. Maybe even worse, I'm unclear whether we'd notice if
 we'd be running into an uncovered edge case or if we'd silently not
 process and therefore lose data.

 For example, what happens if we sanitize logs from a server that receives
 ''very'' few requests, maybe only a few requests per week? Consider these
 original log files (where I scrubbed the virtual host name):
  - `scrubbed.torproject.org-access.log-20171001.gz` contains requests from
 2017-09-30 and 2017-10-01.
  - `scrubbed.torproject.org-access.log-20171002.gz` contains requests from
 2017-10-01 only.
  - `scrubbed.torproject.org-access.log-20171004.gz` contains requests from
 2017-10-03 only.
  - `scrubbed.torproject.org-access.log-20171006.gz` contains requests from
 2017-10-05 and 2017-10-06.

 Would the existing code produce logs for 2017-10-01, -03, -05, and -06
 with exactly the sanitized log lines from these original log files? (I
 didn't run it, I only read the code and am unclear about this.)

 Here's another, related question: what happens if a web server rotates
 logs more often than once per day? At least that's something that we write
 in the specification. I'm not sure how this would work with file names, so
 maybe we in fact require that logs are rotated exactly once per day, and
 we just didn't write that in the specification yet. However, it seems
 rather restrictive to prescribe exact log rotation intervals in order to
 sanitize logs subsequently. Maybe we should be less restrictive here.

 Is there a way to make this approach more robust? And is there a way to
 ensure that we'll learn about any broken assumptions as early as possible?

 Ah, and do you mind doing another round of JavaDoc editing and variable
 renaming towards finding a middle ground between 2-characters-is-almost-
 verbose and 80-characters-can-fit-in-a-line-so-let-us-not-use-more-
 than-79? As a fixup/squash commit without rebasing, please. :) Thank you!

--
Ticket URL: <https://trac.torproject.org/projects/tor/ticket/22428#comment:36>
Tor Bug Tracker & Wiki <https://trac.torproject.org/>
The Tor Project: anonymity online


More information about the tor-bugs mailing list