[metrics-bugs] #22983 [Metrics/metrics-lib]: add a descriptor interface and implementation for web-logs

Tor Bug Tracker & Wiki blackhole at torproject.org
Fri Jul 21 12:39:27 UTC 2017


#22983: add a descriptor interface and implementation for web-logs
---------------------------------+------------------------------
 Reporter:  iwakeh               |          Owner:  metrics-team
     Type:  enhancement          |         Status:  new
 Priority:  Medium               |      Milestone:
Component:  Metrics/metrics-lib  |        Version:
 Severity:  Normal               |     Resolution:
 Keywords:                       |  Actual Points:
Parent ID:                       |         Points:
 Reviewer:                       |        Sponsor:
---------------------------------+------------------------------

Comment (by karsten):

 Replying to [comment:5 iwakeh]:
 > Replying to [comment:4 karsten]:
 >
 > Thanks for the valuable input!
 >
 > > Regarding the name, let's try to find something more descriptive. How
 about `WebServerLog` or even `ApacheHttpServerAccessLog`? Otherwise
 there's the risk of confusion with descriptor types added in the future,
 like a log file written by BridgeDB containing client requests for bridge
 addresses.
 >
 > I see an interface hierarchy here:
 > LogDescriptor as parent for all logs (then we drop 'Descriptor' from the
 names) and have the first extending interface WebServerAccessLog.  Later
 we can add others *Log interfaces like BridgeDbClientLog etc.
 >
 > So for now, I focus on the access-log integration and keep future
 extensions in mind for the design.

 Makes sense.

 > > Regarding the suggested interface, I think there's a short term and a
 long term part here.
 > >
 > > In the long term I think that it would be at least twice as useful if
 we read the log contents and added methods to read these parsed contents.
 It's true that this causes some development hassle. But that's why we do
 it once in the library rather than rely on possibly more than one
 application to get it right. And we can still include the raw descriptor
 bytes by storing the compressed bytes and inflate them upon request.
 >
 > Yes, partially I have this in CollecTor anyway for sanitizing the logs.
 I'll add generally useful functionality to the metrics-lib code.
 > Should we have a new package for the implementations like
 `org.torproject.descriptor.logs`?  The log processing and content differs
 from usual descriptors quite a bit.

 As long as we keep all types that are relevant for applications in
 `org.torproject.descriptor`, I don't mind adding new subpackages.

 > > Some comments on the interface:
 > >  - Let's include a subtype `Request` or similar for each line
 contained in the log file, and let's include a method `getRequests()` that
 returns `Iterable<Request>`.
 >
 > There could be a parent interface LogLine that is extended by an
 appropriate interface for each log type, like a Request interface for
 access-logs.
 > I think about it and definitly keep the design open for the addition,
 but would put it on lesser priority right now.

 Long term sounds fine.

 > >  - Due to the fact that we cannot include a `@type` annotation with a
 version number, `Request` should ideally include getters for all fields
 contained in Apache's Combined Log Format.
 > >  - Ideally, `getLogDate()` would return the date in milliseconds since
 the epoch to be conformant to the rest of metrics-lib, in which case it
 would probably be called `getLogMillis()`.
 >
 > Fine, but we only have the date no time here.  Thus, msec signals a
 precision we don't offer.
 > I don't feel strongly about that.

 Me neither, I just think that it's easier to handle timestamps from
 different data sources if they all use the same format.

 > >  - I'm unclear what `getCompressionType()` returns. I think I'd expect
 a `String` that is either `"gz"` or `"gz"`, but not a `byte[]`. Was that
 intended?
 >
 > Correct, this should read `String getCompressionType()`, just a typo.
 Actually, it might turn into an enum.
 >
 > >  - If we read and parse logs, we'll have to change
 `getUnrecognizedLines()` to return any unrecognized lines.
 >
 > Yes, maybe with an upper limit in case a log got mangled?

 Good idea. First 100?

 > > In the short term I can see how we might want to put the `Request`
 part on hold and only return metadata and uncompressed raw descriptor
 contents in this new descriptor type.
 >
 > Fine, as replied above.
 >
 > Do you have a rough estimate of the future log file sizes metrics-lib
 will have to deal with?

 No idea. I think some of the Apache logs are pretty large in uncompressed
 form. But other descriptors have grown a lot over time, too, like votes.
 And when we recently pondered appending all votes collected in a singly
 CollecTor sync run, our original expectation of the size turned out to be
 pretty useless. So, 20 times the size?

--
Ticket URL: <https://trac.torproject.org/projects/tor/ticket/22983#comment:6>
Tor Bug Tracker & Wiki <https://trac.torproject.org/>
The Tor Project: anonymity online


More information about the metrics-bugs mailing list