[tor-bugs] #20228 [Metrics/CollecTor]: Append all votes with same valid-after time to a single file in `recent/`

Tor Bug Tracker & Wiki blackhole at torproject.org
Wed Oct 5 12:19:25 UTC 2016


#20228: Append all votes with same valid-after time to a single file in `recent/`
-------------------------------+---------------------
 Reporter:  karsten            |          Owner:
     Type:  enhancement        |         Status:  new
 Priority:  High               |      Milestone:
Component:  Metrics/CollecTor  |        Version:
 Severity:  Normal             |     Resolution:
 Keywords:                     |  Actual Points:
Parent ID:                     |         Points:
 Reviewer:                     |        Sponsor:
-------------------------------+---------------------
Changes (by karsten):

 * priority:  Medium => High


Comment:

 I'd like us to move forward here, ideally with descriptors grouped by
 download time and both of us being fully convinced that it's the best way
 forward. :)

 So, let me give you some background on where the `recent/` folder comes
 from.

 A few years back, there was just the `archive/` folder with tarballs that
 were updated every few days.  All services like Tor Metrics, ExoneraTor,
 and Onionoo were running on the same host as CollecTor and using
 CollecTor's directory structure for importing new descriptors.  This was
 very convenient for running these services, but of course very fragile and
 very impossible for others to run similar services.  That's when I turned
 CollecTor into its own service.

 The new CollecTor service had a local directory called `rsync/`, the
 predecessor of `recent/`, which had just the newest files that other
 services would download via `rsync` rather than http.  The idea was to
 provide the latest 72 hours of descriptors, so that services can miss
 updates for up to 3 days (a weekend) without having to fall back to
 importing tarballs from the `archive/` directory.  This fixed the problem
 of running all services on one machine, but it didn't allow others to run
 services.  We quickly learned that rsyncing thousands or even hundreds of
 thousands of files did not scale, so we appended many small descriptors
 into one file per CollecTor update run.

 At some point we made that `rsync/` directory available via http as
 `recent/` and taught Onionoo et al. to download descriptors from there
 instead of relying on a local `rsync` command to magically fetch them.
 This is when other services could first enter the game.  It's also when
 users started browsing the `recent/` directory to have an easy way to
 download descriptors---but that was mostly coincidence and a nice side
 effect.

 Now we're considering changing the directory structure to make it even
 more efficient for services to keep up to date.  Merging votes into single
 files reduces the `index.json*` size while keeping the service exactly as
 useful for other services.  Something that we'll make a bit more difficult
 is accessibility for humans, because they cannot locate a vote as easily
 anymore.

 Also consider a feature request that people ask for every so often:
 provide a search for raw descriptors.  This is something that folks like
 directory authority operators or others who debug the network would find
 really useful.  And these folks might be sad that votes are appended to
 single files and stored by download time rather than valid-after time.
 But it's again coincidence that votes are easily locatable by valid-after
 time.  On the other hand, if a user searches for something different, like
 a relay fingerprint or IP address, they'll likely have to download the
 latest few votes and search locally.

 So, we might even go one step further and store ''all'' descriptors in the
 `recent/` folder by download time.  That would include consensuses of
 which there are usually only per CollecTor update run.  The upside would
 be that it'd become more obvious that all files contain the download time,
 not the published or valid-after time.

 All in all, I'd like to consider the `recent/` folder as an update channel
 for services rather than something that humans browse.  I'm not going to
 stop them from doing that, but I'm very hesitant to make the original use
 case of that directory less useful by supporting this new use case.  And
 we would do that by forcing services to download multiple files containing
 many descriptors they already know.

 Somebody should go and write a descriptor database that takes CollecTor's
 `recent/` folder as input and provides a search interface that returns raw
 descriptors.

 I hope this makes sense.  Please let me know if it doesn't!  And thanks
 for reading this wall of text. ;)

--
Ticket URL: <https://trac.torproject.org/projects/tor/ticket/20228#comment:5>
Tor Bug Tracker & Wiki <https://trac.torproject.org/>
The Tor Project: anonymity online


More information about the tor-bugs mailing list