[metrics-bugs] #33502 [Metrics/CollecTor]: Do not let appended descriptor files grow too large

Tor Bug Tracker & Wiki blackhole at torproject.org
Mon Mar 2 15:12:56 UTC 2020

#33502: Do not let appended descriptor files grow too large
     Reporter:  karsten            |      Owner:  karsten
         Type:  enhancement        |     Status:  assigned
     Priority:  Medium             |  Milestone:
    Component:  Metrics/CollecTor  |    Version:
     Severity:  Normal             |   Keywords:
Actual Points:                     |  Parent ID:
       Points:                     |   Reviewer:
      Sponsor:                     |
 I revisited #20395 last week. The issue is that metrics-lib cannot handle
 large descriptor files, because it first reads the entire file into memory
 before splitting it into single descriptors and parsing them. While it
 would be possible to parse large descriptor files after making some major
 code changes (using `FileChannel` and doing lazy parsing), I don't think
 that we have to do that. After all, we're writing these large descriptor
 files ourselves in CollecTor, and it's up to us to stop doing that.

 Going back in time, the original reason for concatenating multiple
 descriptors into a single file was that rsyncing many tiny files from one
 host to another host was just slow. So we appended server descriptors and
 extra-info descriptors into a single file. This works well with server
 descriptors or extra-info descriptors published within 1 hour or even 10
 hours. It does not work that well anymore with all server descriptors or
 extra-info descriptors synced from another CollecTor instance when
 starting a new instance (#20335). It works even less well when importing
 one or more monthly tarballs containing server descriptors or extra-info
 descriptors (#27716).

 My suggestion is that we define a configurable limit for appended
 descriptor files of, say, 20 MiB. And when storing a descriptor, we check
 whether appending a descriptor to an existing descriptor file would exceed
 this limit and start a new descriptor file in that case.

 There are some technical details to work out, but I think they can be
 solved. I also don't expect this to produce a lot of code, not even
 complex code changes. The benefit would be that we could resolve #20395
 and #27716 by implementing this.

 Thoughts on the general idea?

Ticket URL: <https://trac.torproject.org/projects/tor/ticket/33502>
Tor Bug Tracker & Wiki <https://trac.torproject.org/>
The Tor Project: anonymity online

More information about the metrics-bugs mailing list