[tor-bugs] #20395 [Metrics/metrics-lib]: metrics-lib should be able to handle large descriptor files

Wed May 10 16:28:46 UTC 2017

#20395: metrics-lib should be able to handle large descriptor files
---------------------------------+-----------------------------------
 Reporter:  iwakeh               |          Owner:  karsten
     Type:  defect               |         Status:  new
 Priority:  Medium               |      Milestone:  metrics-lib 2.0.0
Component:  Metrics/metrics-lib  |        Version:
 Severity:  Normal               |     Resolution:
 Keywords:                       |  Actual Points:
Parent ID:                       |         Points:
 Reviewer:                       |        Sponsor:
---------------------------------+-----------------------------------

Comment (by iwakeh):

 I hope I didn't overlook anything:

 `DescriptorFile#getDescriptors()` and
 `DescriptorParser#parseDescriptors()` don't access files.  They receive
 Descriptor objects or bytes and will have to keep the bytes, but these
 methods don't cause an oom unless their caller provides too much.

 The problem lies in the implementation of
 `DescriptorReaderImpl$DescriptorReaderRunnable` (which - as an aside -
 should be a separate class).  There the `readFile` method attempts to read
 an entire file and chokes when encountering a huge file.
 `DescriptorReaderRunnable` should check the file size before opening in
 order to handle the files according to their size.  The oom is caused by
 reading the entire file into memory and then operating on it in-memory
 creating all the Descriptor objects (possibly copying the raw bytes, I
 didn't verify) in-memory.  Memory usage could be reduced
 1. by only reading parts of the huge file and also
 2. by not adding the bytes to the descriptor objects and instead simply
 keeping the file path and position inside the file in-memory.

 Assumptions:
 * many Descriptor objects w/o bytes occupy way less space than the
 Descriptor objects do currently
 * the descriptor containing files are available as long as there are
 Descriptor objects referring to them

 A sketch of changes:

 * Introduce descriptors that either hold their bytes in-memory or have a
 file path and in-file position(s) for accessing raw bytes, but don't store
 the bytes.
 * `DescriptorImpl` parses bytes and produces a list of the adapted
 Descriptor objects.
 * `DescriptorReaderRunnable` needs to read a certain chunk of a large
 file, parse enough to determine the next descriptor, and provide the
 parser also with the beginning and end positions in the file.

 This stays very closely to the current implementation, the details need
 some more work, and it might be necessary to change more.

--
Ticket URL: <https://trac.torproject.org/projects/tor/ticket/20395#comment:7>
Tor Bug Tracker & Wiki <https://trac.torproject.org/>
The Tor Project: anonymity online