[tor-bugs] #25329 [Metrics/Library]: Enable metrics-lib to process large (> 2G) logfiles

Wed Feb 21 18:45:58 UTC 2018

#25329: Enable metrics-lib to process large (> 2G) logfiles
---------------------------------+--------------------------
     Reporter:  iwakeh           |      Owner:  metrics-team
         Type:  enhancement      |     Status:  new
     Priority:  Medium           |  Milestone:
    Component:  Metrics/Library  |    Version:
     Severity:  Normal           |   Keywords:
Actual Points:                   |  Parent ID:  #25317
       Points:                   |   Reviewer:
      Sponsor:                   |
---------------------------------+--------------------------
 Metrics-lib receives compressed logs, usually of sizes below 600kB.  As
 this can be dealt with in-memory, this ticket is about handling the logs
 that deflate to larger files (approx. 2G).

 Commons-compressed doesn't provide methods for determining the deflated
 content size (as the command line tool xz does).  Other compression types
 metrics-lib supports have this option, but it also would require more
 changes.

 Compression can be very effective. Thus, using a cut-off compressed size
 is sort of arbitrary.  An example for xz compression: the 3G deflated log
 has 589492 compressed input array length; using extreme compression it
 even shrinks to a length of 405480; on the other hand a deflated 64M file
 can have an input array of 509212 length.

 For handling larger log files with metrics-lib some interface changes will
 be necessary.  Here a suggestion:

 {{{

  public interface LogDescriptor extends Descriptor {

    /**
 -   * Returns the decompressed raw descriptor bytes of the log.
 +   * Returns the compressed raw descriptor bytes of the log.
 +   *
 +   * <p>For access to the log's decompressed bytes
 +   * use method {@code decompressedByteStream}.</p>
 +   *
     * @since 2.2.0
     */

    public byte[] getRawDescriptorBytes();

    /**
 +   * Returns the decompressed raw descriptor bytes of the log as stream.
 +   *
 +   * @since 2.2.0
 +   */
 +  public InputStream decompressedByteStream();
 +

 }}}

 I think this might be easiest to understand and use; and of course the
 implementation wouldn't need to change processing for large and 'normal'
 logs.  It also avoids deciding about the method to find out if a file is
 large or not.

 Thoughts?

--
Ticket URL: <https://trac.torproject.org/projects/tor/ticket/25329>
Tor Bug Tracker & Wiki <https://trac.torproject.org/>
The Tor Project: anonymity online