[tor-bugs] #13600 [Onionoo]: Improve bulk imports of descriptor archives

Tue Jun 23 08:44:56 UTC 2015

#13600: Improve bulk imports of descriptor archives
-----------------------------+-----------------
     Reporter:  karsten      |      Owner:
         Type:  enhancement  |     Status:  new
     Priority:  normal       |  Milestone:
    Component:  Onionoo      |    Version:
   Resolution:               |   Keywords:
Actual Points:               |  Parent ID:
       Points:               |
-----------------------------+-----------------

Comment (by karsten):

 Replying to [comment:9 leeroy]:
 > No problem. It looks like this branch and deployed Onionoo produce
 slightly different results when processing the same data set (recent 73h).
 I attach a sample (onionoo_k is this branch). I'll test some multiple
 archive imports on this branch.
 >
 > In ''status'':
 >
 >  * The timestamp (?) after the country code is sometimes set to -1.

 I think this one is harmless.  If you're curious, you can read more about
 this by reading the comment in NodeStatus starting with "This is a
 (possibly surprising) hack...".

 > In ''out'':
 >
 >  * Some bandwidth documents have an extra value.

 This one should be harmless, too.  This has to do with running the hourly
 updater at a later time and compressing bandwidth intervals lying farther
 in the past.  We simply don't need the 15-minute precision anymore when
 we're outside of the 3-day graph interval.  There would be similar
 compressions once we're outside the 1-week, 1-month, etc. interval.

 > __Importing multiple months:__ I know Onionoo can, because I tested it
 (testing it on this branch), but should it be encouraged? The current load
 on memory is rather high. If someone tries to import a year of archives at
 once, can the current heap dependency be guaranteed not to induce a
 failure. Maybe this won't be that big a deal. Just warn the operator to
 limit the number of months at a time until other tickets deal with the
 heap load. Something to add to the documentation?

 Yes, this is something we could add to the documentation.  Unfortunately,
 reducing memory requirements enough to import multiple months or even
 years of descriptors is tough, because that's a very different use case
 from running the updater once per hour with only one hour of descriptors.
 When in doubt, I optimized the process in favor of the hourly update
 process.  That's why I'd prefer to add a warning to the documentation.

 > __Input validation:__ I saw metrics-lib included some packages for
 compressed file handling so I tried importing from .xz instead of tarball.
 Some validation of the input archives might be worthwhile. Bad things will
 happen to the log when this is attempted.

 True!  I just created #16424 for this to support importing .xz-compressed
 tarballs.  In general, Onionoo is not very robust against invalid input
 provided by the ''service operator'', because so far that service operator
 person was also the main developer.  But let's try to fix that and make it
 more robust, if we can.

 > __Parsing archives:__ Parse history doesn't include archives, and
 archives aren't removed after parsing. DescriptorDownloader cannot now
 remove the archives (current behavior) because it only considers the
 recent folder.

 Oh, I don't think Onionoo should remove tarballs from the archive
 directory after parsing them, because it didn't place them there
 beforehand.  What we could do, however, is add a parse history for files
 in the archive directory; see the newly created #16426.

 > __Parsing archives:__ If --single-run or --update-only is used with
 archives that have ''already been parsed'', they will be parsed again.
 This leads to a change in the size of the status folder. It becomes
 smaller for the same number of archive-sourced files. I didn't try to
 determine the reason for this change at the time. I intend to revisit this
 potential problem to see if the same thing happens, and why. It might be
 interesting if the change also happens during re-processing of recent data
 (which may happen when restoring a backup of data).

 It would be interesting to learn more about that directory becoming
 smaller.  For now, I'll assume it's related to the differences stated
 above.  But if you spot an actual bug there, please mention it here or
 open a new ticket.

 Thanks for trying this out and sending feedback here!

--
Ticket URL: <https://trac.torproject.org/projects/tor/ticket/13600#comment:10>
Tor Bug Tracker & Wiki <https://trac.torproject.org/>
The Tor Project: anonymity online