[tor-bugs] #19791 [Metrics/metrics-lib]: Use CollecTor's index.json for download; adapt current download to use new date format

Tor Bug Tracker & Wiki blackhole at torproject.org
Wed Aug 24 09:11:50 UTC 2016


#19791: Use CollecTor's index.json for download; adapt current download to use new
date format
---------------------------------+-----------------------------------
 Reporter:  karsten              |          Owner:  iwakeh
     Type:  defect               |         Status:  needs_revision
 Priority:  High                 |      Milestone:  metrics-lib 1.4.0
Component:  Metrics/metrics-lib  |        Version:
 Severity:  Normal               |     Resolution:
 Keywords:                       |  Actual Points:
Parent ID:                       |         Points:
 Reviewer:                       |        Sponsor:
---------------------------------+-----------------------------------
Changes (by karsten):

 * status:  needs_review => needs_revision


Comment:

 Alright, let's go through the last four commits in that branch:

 91fdb9b is already contained in master as 1dab40d.

 a74563a looks good and is merged to my branch (see below) as 7a0e109.

 1276c14 looks good, too, and is merged to my branch as e9c0731.  Thanks
 for caring about tests!

 32f628f is preliminary merged to my branch as 82eb1cf, but I have a couple
 of questions and change requests, as follows:

 Commit bf5ba78 in [https://gitweb.torproject.org/user/karsten/metrics-
 lib.git/log/?h=task-19791 my task-19791 branch] contains a few trivial
 fixes.  Please take a look and let me know if you're okay with those.

 Regarding the new subpackage `org.torproject.descriptor.index`, I'm yet
 unclear whether that's supposed to be "visible" to API users or not, that
 is, if they're supposed to import and/or use those classes directly.  If
 it is, that's going to grow the overall interface quite a bit, requiring
 us to keep the public parts stable in subsequent releases.  For example,
 I'm unclear whether CollecTor is supposed to instantiate its own `*Node`
 classes when writing its `index.json*` files.  Ideally, metrics-lib would
 hide away those classes by providing some kind of
 `DescriptorIndexGenerator` interface that takes a local directory as
 parameter and writes one or more `index.json[*]` files, but I'm not sure
 what your plan is there.  If the plan is to hide that package from users,
 how about we rename it to `org.torproject.descriptor.impl.index` to make
 that clearer?  We could even create similar implementation subpackages in
 the future to clean up the `impl` package a bit, but that's not something
 that users would have to care about.

 Can we avoid changing the parameter usage of
 `DescriptorIndexCollector#collectDescriptors()` by overloading that method
 in the interface?  How about we add a new method `collectDescriptor(String
 collecTorIndexUrl, String collecTorIndexPath, String[] remoteDirectories,
 ...)` and use an implementation-specific default for `collecTorIndexPath`
 in the current method, such as `"/index.html"` for
 `DescriptorCollectorImpl` and `"/index.json.xz"` for
 `DescriptorIndexCollector`?  If we retain backward compatibility here, I'd
 say we could switch to `DescriptorIndexCollector` in 1.5.0 when this class
 has seen some more testing.  Otherwise I'd suggest waiting for 2.0.0.

 The condition `if (!filepath.exists() && filepath.mkdirs())` in
 `DescriptorIndexCollector#fetchRemoteFiles()` looks wrong.  Shouldn't it
 be `!filepath.mkdirs()` there?  I didn't test this, but the API says that
 `mkdirs()` returns "true if and only if the directory was created, along
 with all necessary parent directories; false otherwise".

 Is `FileNode#lastModifiedMillis()` called often?  If so, it may be better
 to use `ParseHelper#getDateFormat()` to obtain a thread-safe `DateFormat`
 instance for the given format and store it in a map.  That avoids creating
 single-use instances of that class over and over.  We'd have to make that
 method public for this, but that shouldn't be an issue.  However, if it's
 not called very often, we can skip this.

 In `IndexNode#retrieveFiles()` I'm not yet clear what situations lead to
 returning the half-done map in the middle of the loop.  If one invalid
 remote directory leads to that case, I think we should either warn and
 skip that remote directory or return an empty map.  Otherwise, the order
 of remote directories would affect the result, and that might lead to bugs
 in user applications that are quite hard to find.

 Can we pick a smaller `test.json` and compressed equivalents in
 `src/test/resources/`?  For example, we could take the `index.json` from
 the main CollecTor instance and throw out all files that haven't been
 modified in the past, say, four weeks.  Or did you include a large file on
 purpose to check something that cannot be tested with a smaller file?

 I'd say once we have resolved these questions, it's good to go into 1.4.0,
 even today.  Thanks for working on this!

--
Ticket URL: <https://trac.torproject.org/projects/tor/ticket/19791#comment:10>
Tor Bug Tracker & Wiki <https://trac.torproject.org/>
The Tor Project: anonymity online


More information about the tor-bugs mailing list