[metrics-bugs] #21751 [Metrics/metrics-lib]: Use multiple threads to parse descriptors

Wed Mar 15 15:50:51 UTC 2017

#21751: Use multiple threads to parse descriptors
-------------------------------------+--------------------------
     Reporter:  karsten              |      Owner:  metrics-team
         Type:  enhancement          |     Status:  new
     Priority:  Medium               |  Milestone:
    Component:  Metrics/metrics-lib  |    Version:
     Severity:  Normal               |   Keywords:
Actual Points:                       |  Parent ID:
       Points:                       |   Reviewer:
      Sponsor:                       |
-------------------------------------+--------------------------
 The following idea came up when I looked a bit into #17831 to speed up
 metrics-lib.

 When we read and parse descriptors from disk, we're using a single thread
 to read and parse descriptors.  It's a daemon thread and not the
 application's main thread, so if the application's thread is busy
 processing parsed descriptors we're at least using two threads.  But we
 could parallelize even more by using separate threads for reading and
 parsing and even using multiple threads for reading and/or for parsing.
 I'll leave the I/O part to #17831 and focus on the multi-threaded parsing
 part here.

 I wrote a little patch that measures time spent on reading tarball
 contents in `DescriptorReaderImpl#readTarballs()` and then extended that
 by moving descriptor parsing code to a separate class that implements
 `Runnable` and that gets executed by an `ExecutorService`.  I initialized
 that executor with `Executors.newFixedThreadPool(n)` for `n = [2, 4, 8,
 16, 32, 64]`.  I also tried `n = 1`, but ran out of memory due to a major
 issue in my simple patch: it reads ''all'' tarball contents to memory when
 creating `Task` instances even if they cannot be executed anytime soon.
 What we should do is block the reader thread when it realizes that the
 executor is already full.  I'm attaching my patch, but only to avoid
 starting from zero the next time.  It needs more work.

 || '''separate parser threads''' || '''read `.tar` file (s)''' || '''parse
 `.tar` file (s)''' || '''read `.tar.xz` file (s)''' || '''parse `.tar.xz`
 file (s)''' ||
 || none (current code) || 35 || 159 || 9 || 162 ||
 || 2 || 36 || 42 || 8 || 126 ||
 || 4 || 41 || 13 || 7 || 96 ||
 || 8 || 42 || 11 || 6 || 35 ||
 || 16 || 41 || 11 || 10 || 28 ||
 || 32 || 45 || 13 || 7 || 34 ||
 || 64 || 41 || 13 || 6 || 38 ||

 These results show that 4 threads speed up the parse time for `.tar` files
 by a '''factor 12''' after which there's no visible improvement, and 8
 threads speed up the parse time for `.tar.xz` files by a '''factor 4.6'''.
 Just from these numbers I'd suggest using 8 threads by default and making
 this number configurable for the application.  But: needs more work.

 My recommendation would be to look more into making parsing multi-threaded
 and save #17831 for later.  It seems like parsing is the lower-hanging
 fruit.

 Note that reading the same tarball in extracted form using the current
 code took 271 seconds.  In that case the lower-hanging fruit might be I/O
 improvements, not multi-threaded parsing.  But my hope is that not many
 applications extract tarballs containing over 800,000 files and read them
 using `DescriptorReader`, especially not if they could as well read the
 tarball directly.

 Suggestions welcome!  Otherwise I might pick this up again and move it
 forward whenever there's time.

--
Ticket URL: <https://trac.torproject.org/projects/tor/ticket/21751>
Tor Bug Tracker & Wiki <https://trac.torproject.org/>
The Tor Project: anonymity online