[metrics-bugs] #23243 [Metrics/Metrics website]: write a spec for web-server-access log descriptors

Tor Bug Tracker & Wiki blackhole at torproject.org
Tue Aug 15 09:59:31 UTC 2017


#23243: write a spec for web-server-access log descriptors
-------------------------------------+-----------------------------------
 Reporter:  iwakeh                   |          Owner:  metrics-team
     Type:  enhancement              |         Status:  needs_information
 Priority:  Medium                   |      Milestone:
Component:  Metrics/Metrics website  |        Version:
 Severity:  Normal                   |     Resolution:
 Keywords:                           |  Actual Points:
Parent ID:                           |         Points:
 Reviewer:                           |        Sponsor:
-------------------------------------+-----------------------------------

Comment (by karsten):

 Replying to [ticket:23243 iwakeh]:
 > This document should answer the following questions:

 Good idea to start such a document! I'll start filling information below.

 > * What will the raw input data look like?
 >  - compressed logs

 Very likely, though compression shouldn't be a strict requirement.

 >  - varying dates in log-lines despite the file being tagged with a
 single date

 Yes, to a certain degree. We'll have to ask the admins for details, but I
 believe that the date in the file name is put in when rotating logs and
 that the date per line is when the host started processing a request. Now,
 it's possible that some requests are received before midnight and
 completed after midnight. And depending on when the log is rotated it's
 possible that some requests are started on the day before the log was
 rotated and finished after rotating the log.

 >  - are there only GET log-lines of 200 responses to be expected?

 No, there might be other methods and other response codes.

 >  - size could be huge (in future)

 Yes.

 >  - exact input format (if possible to define)

 Good question. We should ideally support Apache's Combined Log Format,
 even though we'd currently only receive Tor's privacy* log formats:

 {{{
 LogFormat "0.0.0.0 - %u %{[%d/%b/%Y:00:00:00 %z]}t \"%r\" %>s %b
 \"%{Referer}i\" \"-\" %{Age}o" privacy
 LogFormat "0.0.0.1 - %u %{[%d/%b/%Y:00:00:00 %z]}t \"%r\" %>s %b
 \"%{Referer}i\" \"-\" %{Age}o" privacyssl
 LogFormat "0.0.0.2 - %u %{[%d/%b/%Y:00:00:00 %z]}t \"%r\" %>s %b
 \"%{Referer}i\" \"-\" %{Age}o" privacyhs
 }}}

 And there's already the first contradiction: The `%{Age}o` part is not
 contained in the Combined Log Format:

 {{{
 LogFormat "%h %l %u %t \"%r\" %>s %b \"%{Referer}i\" \"%{User-agent}i\""
 combined
 }}}

 Maybe we require lines to start with the Common Log Format and ignore any
 further fields? Needs discussion.

 >  - meta-data is provided in paths and filenames

 Yep.

 >  - ...
 > * What will sanitized stored (on disk) logs look like?
 >  - cleaned log-lines, define exact format, give examples (as this might
 deviate from the current python sanitation)
 >  - meta-data is provided in paths and filenames
 >  - should files be reassembled, i.e., only log lines of a given date in
 a descriptor for that log date?

 Yes! That's important! Otherwise we'll leak information of lines contained
 for a given date before/after rotating logs. That's a much shorter time
 frame than 24 hours then. We'll have to do this.

 >  - should storage (on disk) be in compressed files (opposed to storing
 other descriptors uncompressed)?

 Yes. Configurable by the application, but yes.

 >  - Should such log be stored (on disk) in reasonably sized chunks (once
 a GB size is reached)?

 No, compression should already reduce the size enough so that we'll never
 run into such sizes. Never!

 >  - ...
 >
 > Please add more.

 Looks like a good start! Will add more as more comes to mind.

--
Ticket URL: <https://trac.torproject.org/projects/tor/ticket/23243#comment:2>
Tor Bug Tracker & Wiki <https://trac.torproject.org/>
The Tor Project: anonymity online


More information about the metrics-bugs mailing list