[tor-bugs] #20395 [Metrics/Library]: Add capability to handle large descriptor files

Tor Bug Tracker & Wiki blackhole at torproject.org
Fri Feb 9 11:33:18 UTC 2018


#20395: Add capability to handle large descriptor files
-----------------------------+------------------------------
 Reporter:  iwakeh           |          Owner:  karsten
     Type:  defect           |         Status:  needs_review
 Priority:  Medium           |      Milestone:
Component:  Metrics/Library  |        Version:
 Severity:  Normal           |     Resolution:
 Keywords:                   |  Actual Points:
Parent ID:                   |         Points:
 Reviewer:                   |        Sponsor:
-----------------------------+------------------------------
Changes (by karsten):

 * status:  accepted => needs_review


Comment:

 I started making some improvements here. Here's my train of thought:
  1. Rather than reading the whole file to memory at the beginning, we
 could read it in chunks and start parsing as soon as we have seen a full
 descriptor. This sounds like a useful improvement, but it's actually very
 limited, at least on its own. Reading the 70M descriptor file I used for
 testing is actually done really fast. It's the parsing that takes long. As
 long as we need the full descriptor file contents in memory, we don't have
 to think about reading files in chunks. (See also 3. below.)
  2. Rather than parsing all descriptors contained in a given file into a
 list and then taking all parsed descriptors and throwing them into the
 `BlockingIterator<Descriptor>`, we could just skip the list in the middle.
 The effect is that the time to first descriptor is reduced by a huge
 amount of time, whereas the time to last descriptor stays the same. I
 prepared a patch for this. The commit message contains more details.
  3. Rather than storing descriptor file contents in a `byte[]`, we could
 go through the file, read descriptor by descriptor, and store a `File`
 reference together with offset and length into the file. The effect would
 be that we're avoiding to keep the raw descriptor file contents in memory
 at all. We'd still keep parsed contents in memory. A possible downside is
 that the file must not be deleted or moved away while the application
 processes descriptors, which should be safe to require. Still, this is a
 larger change than 2. And it requires 1. That's why I postponed this.

 Please review [https://gitweb.torproject.org/user/karsten/metrics-
 lib.git/commit/?h=task-20395&id=ef9406c148a477720cdca67c6a2891ecd850f912
 commit ef9406c in my task-20395 branch].

--
Ticket URL: <https://trac.torproject.org/projects/tor/ticket/20395#comment:15>
Tor Bug Tracker & Wiki <https://trac.torproject.org/>
The Tor Project: anonymity online


More information about the tor-bugs mailing list