[metrics-bugs] #31204 [Metrics/CollecTor]: Extend file objects in index.json to include descriptor types, publication times, and file digests

Tor Bug Tracker & Wiki blackhole at torproject.org
Mon Aug 5 20:04:14 UTC 2019


#31204: Extend file objects in index.json to include descriptor types, publication
times, and file digests
-------------------------------+--------------------------
 Reporter:  karsten            |          Owner:  karsten
     Type:  enhancement        |         Status:  accepted
 Priority:  Medium             |      Milestone:
Component:  Metrics/CollecTor  |        Version:
 Severity:  Normal             |     Resolution:
 Keywords:                     |  Actual Points:
Parent ID:                     |         Points:
 Reviewer:                     |        Sponsor:
-------------------------------+--------------------------
Changes (by karsten):

 * owner:  metrics-team => karsten
 * status:  new => accepted


Comment:

 I started working on this today. I do have some code here that supports
 running in the background using a thread pool, but I'll have to spend at
 least another day or two on this before it's ready for review.

 A few observations from writing this code and testing it locally:

  1. Reading tarballs to find out descriptor types and publication times is
 really time consuming. A test run with 643M of data took roughly 10
 minutes on my laptop. For comparison, our archive is 95G in size, so about
 150 times the size. We might want to index the archive on an external
 machine that is not the CollecTor host. And we need to be clear that the
 server will be busy for 10-20 minutes after creating new tarballs every 2
 to 3 days. Neither of which being a major concern, just stating it.

  2. Interestingly, computing SHA-256 digests of tarballs only took about 5
 seconds of these 10 minutes, so that's really, really cheap compared to
 reading tarballs and extracting descriptor types and publication times.

  3. I wonder how it will work out in practice that these new fields will
 be blank for 10-20 minutes for newly created tarballs. In many cases,
 newly created tarballs replace existing tarballs from a few days ago for
 which these fields were available. One effect would be that the latest
 published timestamp for a given descriptor type will flap between, say,
 middle of a month to end of the previous month, only because the tarball
 for the current month is replaced. Maybe we need to do something more
 elaborate where we put newly created tarballs into a staging area where we
 parse them and then move them into place.

 I'll think more about these issues (mainly the third one) and work more on
 the code as time permits. Grabbing the ticket, because it doesn't really
 make sense for somebody else to re-do what I did so far.

--
Ticket URL: <https://trac.torproject.org/projects/tor/ticket/31204#comment:2>
Tor Bug Tracker & Wiki <https://trac.torproject.org/>
The Tor Project: anonymity online


More information about the metrics-bugs mailing list