[tor-bugs] #19778 [Metrics/CollecTor]: Bridge descriptor sanitizer runs out of memory after 13.5 days

Thu Jul 28 19:18:34 UTC 2016

#19778: Bridge descriptor sanitizer runs out of memory after 13.5 days
-----------------------------------+-----------------------------
     Reporter:  karsten            |      Owner:
         Type:  defect             |     Status:  new
     Priority:  High               |  Milestone:  CollecTor 1.1.0
    Component:  Metrics/CollecTor  |    Version:
     Severity:  Normal             |   Keywords:
Actual Points:                     |  Parent ID:
       Points:                     |   Reviewer:
      Sponsor:                     |
-----------------------------------+-----------------------------
 I'm currently reprocessing the bridge descriptor archive for #19317.  The
 process, started with `-Xmx6g` on a machine with 8G RAM, ran out of memory
 after 13.5 days.  I uploaded the custom log with additional debug lines
 for the currently processed tarball here:
 https://people.torproject.org/~karsten/volatile/collector-
 bridgedescs.log.xz (556K).

 While writing tests for #19755, I noticed a possible explanation, though I
 don't have facts to prove: `BridgeSnapshotReader` contains a `Set<String>
 descriptorImportHistory` that stores SHA-1 digests of files and single
 descriptors to skip duplicates as early as possible.  Its effect can be
 seen in log lines like this, which comes from reprocessing 1 day of
 tarballs:

 {{{
 2016-07-28 11:54:31,206 DEBUG o.t.c.b.BridgeSnapshotReader:215 Finished
 importing files in directory in/bridge-descriptors/.  In total, we parsed
 87 files (skipped 9) containing 24 statuses, 33984 server descriptors
 (skipped 168368), and 29618 extra-info descriptors (skipped 50027).
 }}}

 I don't know a good way to confirm this theory other than running the
 process once again for a few days and logging the size of that set.  I
 also tried attaching `jvisualvm` last time, but for some reason that
 detached and froze after 90 hours.

 Possible fixes:
  - Use some kind of least-recently-used (or maybe least-recently-inserted
 if that's easier to implement) cache that allows us to skip duplicates in
 tarballs written on the same day or so.  There's no harm in reprocessing a
 duplicate, it just takes more time than skipping it.  Needs some testing
 to get the size right, though it seems from the log above that 100k
 entries might be enough.
  - Avoid keeping a set and instead start the sanitizing process until we
 know enough about a descriptor to check whether we wrote it before.  That
 would mean computing the SHA-1 digest and parsing until reaching the
 publication time.  In early tests this increased processing time by factor
 1.2 or 1.3, and even more processing time is not exactly what I'm looking
 for.
  - Are there other options, ideally ones that are easy to implement and
 maintain?

--
Ticket URL: <https://trac.torproject.org/projects/tor/ticket/19778>
Tor Bug Tracker & Wiki <https://trac.torproject.org/>
The Tor Project: anonymity online