[metrics-bugs] #22428 [Metrics/CollecTor]: Add webstats module

Tor Bug Tracker & Wiki blackhole at torproject.org
Thu Jan 25 09:47:51 UTC 2018


#22428: Add webstats module
-------------------------------+---------------------------------
 Reporter:  iwakeh             |          Owner:  iwakeh
     Type:  enhancement        |         Status:  needs_revision
 Priority:  High               |      Milestone:  CollecTor 1.5.0
Component:  Metrics/CollecTor  |        Version:
 Severity:  Normal             |     Resolution:
 Keywords:  metrics-2018       |  Actual Points:
Parent ID:                     |         Points:
 Reviewer:                     |        Sponsor:
-------------------------------+---------------------------------

Comment (by iwakeh):

 Replying to [comment:50 karsten]:
 > Alright, I made a few tweaks to your branch
 [https://gitweb.torproject.org/karsten/metrics-db.git/log/?h=task-22428-4
 in my task-22428-4 branch] with the latest commit being 1f1adec.

 Changes look fine, thanks!  I will do further work based on this branch.

 >
 > I also ran a first round of tests. Here's what I found:
 >  - The `recent/` directory contains only 1 log file per physical/virtual
 host from 3 days ago. The reason is that the last 2 days are excluded as
 too recent. But the `recent/` directory is supposed to contain the last 72
 hours of data produced by CollecTor, not data originally published within
 the last 72 hours. The goal is to give CollecTor clients 72 hours to
 obtain newly provided data before they need to fall back to archives. I
 think the other modules only look at last-modified time of provided files
 and delete them after 72 hours. Maybe we should do that here, too.

 This is related to comment:44, which quotes the relevant section of our
 spec (and still waits for comments).
 The spec allows only earliest logs from two days before the current date.
 Providing data for the last three days of collecting results would mean to
 extend the logic to allow data from up to five days ago.  Example for the
 '''intended''' functionality:
    Log files from (omitting the year) 1/11, 1/12, 1/13, 1/14, 1/15, 1/16,
 1/17 should result in having the logs of 1/13, 1/14, 1/15 in 'recent'.

 If yes I'll add a patch commit.

 >  - The regular expression in metrics-lib's `WebServerAccessLogLine` is
 too strict. It does not consider the following log line to be valid, but
 it should: `0.0.0.0 - - [22/Jan/2018:00:00:00 +0000] "GET
 /collector/archive HTTP/1.1" 301 -`. We should probably also compare the
 regular expression to the Apache specification for similar cases where
 we're being too strict.

 If such lines are contained and valid we should make sure it is also
 reflected in the spec.
 I will provide a patch.

 >  - There are subtle differences between files provided by webstats.tp.o
 and the ones produced here. We should probably write a simple (shell)
 script to convert files from webstats.tp.o to the format we'd have
 produced. Things like cutting off columns at the end or cutting off the
 `?` from parameters. We'd run that script once when copying over files.

 I'm objecting against the script solution.
 The easiest way and another way of testing might be to simply import these
 using CollecTor.  Setting log level to debug will tell us what gets
 dropped and provide another good test scenario with a useful result.

 >  - I guess we'll need to extend create-tarballs script to produce
 tarballs containing webstats files. I haven't looked, though.

 Oh, true.

 >
 > Do you want to look into these issues?

 Yep.
 Will you later import the old logs?  I could also do that on corsicum or
 on my machine and upload there.

 >
 > Remaining changes after those above are:
 >  - Run another round of tests with most or even all original log files
 as input.
    ^^^^ karsten's task

   - import the old logs (iwakeh or karsten see above)

 >  - Clean up `CHANGELOG.md`.
 >  - Release metrics-lib 2.2.0 and update `build.xml`.
 >  - Update copyrights to 2018.

 I could rebase your most recent branch on master and then add these
 changes?

--
Ticket URL: <https://trac.torproject.org/projects/tor/ticket/22428#comment:51>
Tor Bug Tracker & Wiki <https://trac.torproject.org/>
The Tor Project: anonymity online


More information about the metrics-bugs mailing list