[tor-bugs] #13600 [Onionoo]: Improve bulk imports of descriptor archives

Wed Oct 29 11:39:24 UTC 2014

#13600: Improve bulk imports of descriptor archives
-------------------------+---------------------
 Reporter:  karsten      |          Owner:
     Type:  enhancement  |         Status:  new
 Priority:  normal       |      Milestone:
Component:  Onionoo      |        Version:
 Keywords:               |  Actual Points:
Parent ID:               |         Points:
-------------------------+---------------------
 We need to improve bulk imports of descriptor archives.  Whenever somebody
 wants to initialize Onionoo with existing data, they'll need to process
 years of descriptors.  The current code is not at all optimized for that,
 but it's designed for running once per hour and updating things as quickly
 as possible.  Let's fix that and support bulk imports better.

 Here's what we should do:

  - We define a new directory `in/archive/` where operators can put
 descriptor archives fetched from CollecTor.  Whenever there are files in
 that directory we import them first (before descriptors in `in/recent/`).
 In particular, we iterate over files twice: in the first iteration we look
 at the first contained descriptor to determine its type, and in the second
 iteration we parse files containing server descriptors and then files
 containing other descriptors.  (This order is important for computing
 advertised bandwidth fractions, which only works if we parse server
 descriptors before consensuses.)  This process will take very long, so we
 should log whenever we complete a tarball, and ideally we'd print out how
 many tarballs we already parsed and how many more we need to parse.
  - We add a new command-line switch `--update-only` for only updating
 status files and not downloading descriptors or writing document files.
 Operators could then import archives, which would take days or even weeks,
 and then switch to downloading and processing recent descriptors.  My
 branch task-12651-2 is a major improvement here, because it ensures that
 ''all'' documents will be written once the bulk import is done, not just
 the ones for relays and bridges that were contained in recent descriptors.
 Future command-line options would be `--download-only` and `--write-only`
 for the other two phases and `--single-run` that does what's the current
 default but once we switch from being called by cron every hour to
 scheduling our own hourly runs internally.

 I somewhat expect us to run into memory problems when importing months or
 even years of data at once.  So, part of the challenge here will be to
 keep an eye on memory usage and fix any memory issues.

--
Ticket URL: <https://trac.torproject.org/projects/tor/ticket/13600>
Tor Bug Tracker & Wiki <https://trac.torproject.org/>
The Tor Project: anonymity online