[tor-bugs] #31204 [Metrics/CollecTor]: Extend file objects in index.json to include descriptor types, publication times, and file digests

Fri Jul 19 07:06:05 UTC 2019

#31204: Extend file objects in index.json to include descriptor types, publication
times, and file digests
-----------------------------------+--------------------------
     Reporter:  karsten            |      Owner:  metrics-team
         Type:  enhancement        |     Status:  new
     Priority:  Medium             |  Milestone:
    Component:  Metrics/CollecTor  |    Version:
     Severity:  Normal             |   Keywords:
Actual Points:                     |  Parent ID:
       Points:                     |   Reviewer:
      Sponsor:                     |
-----------------------------------+--------------------------
 atagar suggested to extend file objects in CollecTor's `index.json` to
 include descriptor types, publication times, and file digests.

 As of now, file objects in the `index.json` file have the following
 fields:

  - `"path"`: Relative path of the file.
  - `"size"`: Size of the file in bytes.
  - `"last_modified"`: Timestamp when the file was last modified using
 pattern `"YYYY-MM-DD HH:MM"` in the UTC timezone.

 The new fields could be defined as follows, though this is very much
 subject to discussion on this ticket:

  - `"types"`: List of descriptor types as found in `@type` annotations of
 contained descriptors (optional).
  - `"first_published"`: Earliest published timestamp (or similar) of
 contained descriptors (optional).
  - `"last_published"`: Latest published timestamp (or similar) of
 contained descriptors (optional).
  - `"sha256"`: SHA-256 digest of the file, encoded as base64 (optional).

 All these new fields seem reasonable things to add, and I don't see why we
 wouldn't want to add them. The index will get bigger, but that sounds
 acceptable. The coding effort is non-zero, which is something we'll have
 to admit. But all in all, I don't see a blocker for doing this.

 Implementation note: All these new fields have in common that they're not
 just file attributes that we can easily obtain from Java's `File` class.
 We'll have to open and read files in order to obtain these fields, and
 that's very time-consuming. I could see how we do this in a background
 thread (or thread pool) started by CollecTor's `CreateIndexJson.java` with
 a state file of some sort to avoid reprocessing files that haven't
 changed. And while this thread (pool) hasn't completed processing a file,
 the index would simply omit these new fields (not files!), which is why
 fields are defined as optional above.

 What else did I miss? atagar, please fill in any thoughts that I left out.

 Once we agree on the spec here, this could be a fine little project for a
 volunteer.

--
Ticket URL: <https://trac.torproject.org/projects/tor/ticket/31204>
Tor Bug Tracker & Wiki <https://trac.torproject.org/>
The Tor Project: anonymity online