[tor-bugs] #13600 [Onionoo]: Improve bulk imports of descriptor archives

Sun Jun 28 00:38:32 UTC 2015

#13600: Improve bulk imports of descriptor archives
-----------------------------+-----------------
     Reporter:  karsten      |      Owner:
         Type:  enhancement  |     Status:  new
     Priority:  normal       |  Milestone:
    Component:  Onionoo      |    Version:
   Resolution:               |   Keywords:
Actual Points:               |  Parent ID:
       Points:               |
-----------------------------+-----------------

Comment (by leeroy):

 Thank you for clearing up the slight differences mentioned. I was hoping
 those were minor. There were other differences, but they were clearly
 trivial (like omission of rdns, or use of ip for unresolved rdns). I'll
 take a look at the code again in NodeStatus.

 __Input validation:__ Excellent, I was thinking this too! If extra
 validation is going to be performed, it's also worth checking out
 streaming data from the archives directly. I suspect this will be to a
 significant advantage, as it will no longer be needed to take up extra
 space for the uncompressed tarball.

 __Parsing archives:__ Sounds good. I was thinking of at least warning the
 operator about an accumulation of archives, but with #16424 this isn't as
 much of a problem.

 __Importing multiple months:__ I was testing this together with looking
 into reproducing the smaller directory for parsed data. I got the out-of-
 memory-heap error while using --update-only with '''two''' months. It
 occurred at approx. 80% (based on time), during consensus parsing (based
 on stack trace). So parsing is itself very sensitive to heap memory. I
 have some thoughts on how to solve this. Besides the disk-based data
 structures to reduce heap dependency, I'll take a look again at metrics-
 lib to see if it can benefit from lexer-parser improvements. The heap
 dependency during parse could be reduced, while increasing ease-of-
 maintenance, by using a grammar-based recognizer, streaming reads (from
 archives), and lock-free (cas) lists. It creates a parse-stage that scales
 to I/O if done right. Combines parse and write, reducing heap requirement.

 __Parsing archives:__ Due to the out-of-memory error I restarted this test
 using a smaller data set. I also hope it's harmless, but having seen it I
 don't want to rule it out unless provable. I'll notify you here once I
 know for sure.

--
Ticket URL: <https://trac.torproject.org/projects/tor/ticket/13600#comment:11>
Tor Bug Tracker & Wiki <https://trac.torproject.org/>
The Tor Project: anonymity online