[tor-bugs] #22141 [Metrics/metrics-lib]: Deprecate `DescriptorFile` and add relevant information to `Descriptor`

Wed Jun 14 09:42:09 UTC 2017

#22141: Deprecate `DescriptorFile` and add relevant information to `Descriptor`
---------------------------------+-----------------------------------
 Reporter:  karsten              |          Owner:  karsten
     Type:  enhancement          |         Status:  needs_review
 Priority:  Medium               |      Milestone:  metrics-lib 1.9.0
Component:  Metrics/metrics-lib  |        Version:
 Severity:  Normal               |     Resolution:
 Keywords:                       |  Actual Points:
Parent ID:                       |         Points:
 Reviewer:                       |        Sponsor:
---------------------------------+-----------------------------------

Comment (by iwakeh):

 Replying to [comment:13 karsten]:
 > I'm afraid I found two flaws in the design above and the suggestion on
 #22196:
 >
 >  1. We're using the `Iterable<Descriptor>` and `InvalidDescriptor` as
 one possible returned element as channel for all kinds of exceptions
 thrown while reading descriptor files.  But we don't have such a channel
 in `DescriptorCollector#collectDescriptors` which returns, well, `void`.
 That means that the design above doesn't solve anything of
 [https://trac.torproject.org/projects/tor/ticket/16225#comment:7 this
 issue raised on the related ticket #16225].  And ideally we'd handle
 exceptions in the same way in all descriptor sources.

 `DescriptorCollectorImpl` is deprecated:
 {{{
 * @deprecated Replaced by {@link DescriptorIndexCollector} which uses the
  *     remote instance's index.json file as a more robust alternative to
 parsing
  *     the remote instance's directory listings.
 }}}

 Using the `DescriptorIndexCollector` implementation no descriptors are
 parsed during download.
 What did I overlook?

 >  1. The idea of replacing the parse history with a `minLastModified`
 timestamp doesn't handle I/O problems very well.  For example, in the
 current implementation, if there's a network problem with downloading a
 potentially large file from CollecTor, we would not include that in the
 parse history and retry collecting and reading it next time.  But with
 only a timestamp we would simply skip that descriptor file, which is
 pretty bad.
 >
 > New plan:
 >  - We rename `InvalidDescriptor` to `UnparseableDescriptor` and let it
 return the `DescriptorParseException` that made it unparseable in our
 view.  This instance ''might'' be useful to the application, at least by
 containing the raw descriptor `byte[]` or descriptor `File` reference.
 And if the application produced the input descriptor itself, like
 CollecTor, knowing about it being unparseable is more important than for a
 consumer, like Onionoo!

 I assumed that InvalidDescriptor would provide the bytes and other
 descriptor meta-data, the renaming makes sense.

 >  - We leave the parse history in place and postpone the idea of
 stateless, overloaded methods.  Sad.

 Not so sad, actually.  From an API point of view hiding away the history
 bookkeeping reduces complexity.

 >  - We add a method `removeFromHistoryFile(File)` that can be called by
 the application if it wants to reprocess a descriptor file containing an
 `UnparseableDescriptor`.  This would not be necessary for I/O exceptions,
 because those descriptor files would not go into the parse history and
 retried in the next execution anyway.
 >  - We log any exceptions on `warn` level that we caught while collecting
 or reading or parsing descriptors.  The application can't ''do'' anything
 to handle these issues anyway, except for telling the operator that
 something went wrong.  `DescriptorParseException`s thrown while parsing
 invalid/unparseable descriptors are exempt from this and will only be
 logged on info level.

 I think INFO logging to the regular log could bloat logs unnecessarily;
 more below.

 >  - We ''could'' introduce an `ExceptionListener` for exceptions are than
 `DescriptorParseException`, or the "ParseExceptionsLog" that you mentioned
 (even though I'm not entirely certain what that would be).  I don't think
 it's necessary though, because we wouldn't it use that ourselves, and we
 don't know whether it would be used by anyone.

 I think an exception listener is not really necessary.  The 'logging
 channel' suggestion was meant as introducing another Logger as for example
 is used in Onionoo for statistics:
 `LoggerFactory.getLogger("statistics")...`.
 Such a `LoggerFactory.getLogger("ParseExceptionsLog")` for all parsing
 problems would 'channel' these log statements and the final receiver (be
 it log or /dev/null) can be changed via runtime configuration.
 Does that make sense?

 >
 > Before I write more code (or longer comments!), what do you think? :)

 I didn't look at the branch below, yet.  Should I now?

--
Ticket URL: <https://trac.torproject.org/projects/tor/ticket/22141#comment:15>
Tor Bug Tracker & Wiki <https://trac.torproject.org/>
The Tor Project: anonymity online