[tor-bugs] #18910 [Metrics/CollecTor]: distributing descriptors accross CollecTor instances

Thu Oct 13 19:23:14 UTC 2016

#18910: distributing descriptors accross CollecTor instances
-------------------------------+---------------------------------
 Reporter:  iwakeh             |          Owner:  iwakeh
     Type:  enhancement        |         Status:  needs_review
 Priority:  High               |      Milestone:  CollecTor 1.1.0
Component:  Metrics/CollecTor  |        Version:
 Severity:  Normal             |     Resolution:
 Keywords:  ctip               |  Actual Points:
Parent ID:                     |         Points:
 Reviewer:                     |        Sponsor:
-------------------------------+---------------------------------

Comment (by iwakeh):

 I think below I reply to all of [comment:25] except the paths structures,
 which will have to wait for the commit of the tests.
 Changes can be found in the new commit of the
 [https://gitweb.torproject.org/user/iwakeh/collector.git/commit/?h=task-18910
 -first-sync-2&id=aafedaa74e8451303e8cdaa882fd2821927d29ab second branch].

 > ...  - All classes in the new `persist` package contains unused imports
 or unused attributes.  Can you remove those in a fixup commit?

 Removed in second branch.

 >
 >  - Can we somehow use `printf` patterns for putting together paths to
 make that bunch of `DescriptorPersistence` classes easier to read?  It
 should be possible to express all paths using placeholders for
 dates/times, strings, and chars.  And if we care about path separators
 being platform independent, we could split the resulting path at `/`
 characters and put it back together using `Paths.get()`.  Untested example
 for a bridge extra-info descriptor:
 >
 > {{{
 > String.format("bridge-descriptors/%tY/%<tm/extra-infos/%c/%c/%s",
 >     desc.getPublishedMillis(),
 >     desc.getExtraInfoDigest().charAt(0),
 >     desc.getExtraInfoDigest().charAt(1),
 >     desc.getExtraInfoDigest());
 > }}}
 >

 What makes this more readable to me, is the code formatting, i.e. one line
 per path element.  Whereas, I don't read string format patterns that
 often, which makes it less readable to me.  (String.format is also locale
 dependent and would have to have an additional argument. )
 Added the one line per path element to the new branch as well as some
 comments for the path elements. Hope this works for you too?

 >  - Building upon the previous idea, maybe we can avoid having
 `DescriptorPersistence` classes at all if all we need is a type annotation
 and two `printf` patterns.  We could define an order of `printf` arguments
 for all patterns, like: published, received, source, digest, first char of
 digest, second char of digest.  That would really save a few lines of
 code, wouldn't it?  Maybe something for later?

 The path deriving code surely can be done nicer.  The main goal for the
 current code was to extract the somewhat hidden (in huge methods/classes)
 path structures and have them in a defined location, which opens the door
 for future simplifications.
 Having persistence classes might be useful for other types of persistence
 or the statistics (cf. #20335, #20345).

 > ...
 >  - In `ConsensusPersistence`, the check for `null ==
 desc.getConsensusFlavor()` seems too fragile with respect to flavors we'll
 add in the future.  We wouldn't notice except for microdesc consensuses
 being overwritten.  We should check the actual consensus flavor if it's
 not `null`.

 If flavor is something other than "microdesc", the consensus type will be
 plain consensus.

 >
 >  - In `DescriptorPersistence`, I wonder if we should not accept
 `StandardOpenOption` or even `StandardOpenOption ...` to distinguish
 replace vs. append but rather just a boolean.  It seems like an
 implementation level detail we're handing out to the caller.  Are we sure
 we're implementing all options and in particular combination of options
 correctly?  Those checks where we're only looking at how many such option
 values we're being given look really fragile.  If we only want to support
 replace vs. append, we should only provide those two options in the
 interface.
 >
 >  - Same class, method `storeAll()` should be more explicit in the method
 name or at least documentation that `recent` will only be written if
 writing `archive` is successful.
 >

 The StandardOpenOptions are defined in the java api, so there is no
 implementation to be done.  There are more than two options of interest
 (afaict): create_new, append, and truncate_existing, i.e. do not do
 anything in case the file exists, append, and replace.
 I added more javadoc and overloaded the methods, so we have methods w/o
 options, and other methods with exactly one option per file.

 >  - Same class, methods `storeRecent()` and `storeOut()` might have wrong
 argument orders in their `Paths.get()` calls, but I'm not sure.

 I renamed the parameters to `recentRoot` and `outRoot` to better express
 their meaning.

 >
 >  - In `PersistenceUtils`, method `storeToFileSystem()`, that check for
 `append.length == 0` could be changed to `append.length <= i` to make the
 method a tiny bit more robust against wrong usage.

 I thought the same, so the method wasn't in use and is now removed.

 >
 >  - Same class, are the `dateDependentPath*()` methods used anywhere?  If
 not and if there are no plans to use them, can you remove them?

 Those are gone in the second branch.

 > ...
 >  - The following didn't become entirely clear from reading the code, so
 let me ask: are files written to temporary files like `.somefile.tmp` and
 later renamed to `somefile`?  We should make sure we're doing that for
 files in `recent/` that we append descriptors, because otherwise those
 half-written files will be indexed and made available, which we should
 avoid.  Basically, if a file may change after being written, let's write
 it to a temp file and only rename it to its final file name when we're
 sure it won't change anymore.

 Files for 'recent' are first written to a temporary file and then moved to
 the final location.

 >  - Similar to the previous comments, I found a few formattings that
 should have been caught by checkstyle.

 There were no complaints, checked again for the changes for the commit on
 the second branch.

 Regarding @type annotations:  I introduced an `Annotation` enum, which
 contains all @type annotations.  All occurances of @type strings are
 replaced.  So, now these annotations are in one place.  But, don't they
 actually belong to metrics-lib?  There the parser could check for unknown
 annotation versions or types.

--
Ticket URL: <https://trac.torproject.org/projects/tor/ticket/18910#comment:45>
Tor Bug Tracker & Wiki <https://trac.torproject.org/>
The Tor Project: anonymity online