[tor-bugs] #20395 [Metrics/metrics-lib]: metrics-lib should be able to handle large descriptor files

Wed May 10 20:28:59 UTC 2017

#20395: metrics-lib should be able to handle large descriptor files
---------------------------------+-----------------------------------
 Reporter:  iwakeh               |          Owner:  karsten
     Type:  defect               |         Status:  new
 Priority:  Medium               |      Milestone:  metrics-lib 2.0.0
Component:  Metrics/metrics-lib  |        Version:
 Severity:  Normal               |     Resolution:
 Keywords:                       |  Actual Points:
Parent ID:                       |         Points:
 Reviewer:                       |        Sponsor:
---------------------------------+-----------------------------------

Comment (by karsten):

 Great ideas above!  And I think we should implement them, because they
 clearly improve memory consumption.

 Going into more details, your second assumption makes sense to me.  I
 didn't think of that before, but I agree that we can make that assumption.

 However, your first assumption is unfortunately wrong.  I just
 concatenated all votes from May 1, 2017 to a single file with a size of
 0.8G.  I passed that to metrics-lib to read and parse it, which consumed
 4.1G of memory in total for parsed descriptors and contained raw
 descriptor bytes.  I then modified `DescriptorImpl` to avoid storing raw
 descriptor bytes in memory, which led to memory consumption of 3.3G.  The
 difference is precisely the 0.8G of the original file.  But the 3.3G still
 remain, and that number will grow with the number of descriptors we put in
 a file.  Like, 72 hours of votes from an initial CollecTor sync with all
 votes concatenated to a single file would consume 9.9G, plus a few G more
 while parsing.  No, the suggestion above is certainly an improvement, but
 it still does not scale.

 But I could see us making these suggested improvements anyway, and they'll
 help us going forward.  Some thoughts:
  - We could modify `Descriptor#getRawDescriptorBytes()` to use its file
 reference and start and end position to retrieve the bytes from disk and
 return them to the caller.  That is, rather than bothering the user to do
 that.  This would even make this change backward-compatible.
  - We should avoid calling that new `Descriptor#getRawDescriptorBytes()`
 ourselves at all costs while parsing and instead pass the bytes around
 directly.  I'm mentioning this explicitly, because I found uses of that
 method where we could have passed around these bytes as parameters
 instead.
  - We need to be careful to write the reading-files-in-chunks logic in a
 way that detects descriptor starts and ends across chunk boundaries.
 Think of tiny descriptors like microdescriptors.
  - And we should avoid scanning chunks repeatedly when a descriptor covers
 many, many such chunks.  Think of huge descriptors like votes.

 Once we're there, let's talk more about avoiding to keep potentially huge
 lists of parsed descriptors in memory.

 Do you want to start hacking on your suggestions above?

--
Ticket URL: <https://trac.torproject.org/projects/tor/ticket/20395#comment:8>
Tor Bug Tracker & Wiki <https://trac.torproject.org/>
The Tor Project: anonymity online