[tor-bugs] #6569 [Stem]: Stem network status document parsing

Thu Aug 9 13:29:23 UTC 2012

#6569: Stem network status document parsing
--------------------+-------------------------------------------------------
 Reporter:  neena   |          Owner:  neena       
     Type:  task    |         Status:  needs_review
 Priority:  normal  |      Milestone:              
Component:  Stem    |        Version:              
 Keywords:          |         Parent:              
   Points:          |   Actualpoints:              
--------------------+-------------------------------------------------------
Changes (by neena):

  * status:  needs_revision => needs_review


Comment:

 Replying to [comment:2 atagar]:
 > > stem/descriptor/__init__.py
 > > +  import stem.descriptor.networkstatus_descriptor
 > Wrong import and mix of new/old usage. I'm guessing that the tests don't
 cover parse_file()? If it is being tested then this might be due to a
 phantom networkstatus_descriptor.pyc (I hate it when that happens).
 Yeah, it was a phantom pyc. fixed.

 > > +class DescriptorParser:
 > > +  """
 >
 > Needs to extend object. I'm also kinda worried about how it holds on to
 a copy of both _raw_content and lines. Ideally it would have neither. I
 realize that the Descriptor class has a _raw_content attribute for its
 __str__() method. In DescriptorParser's case though _raw_content is
 presently unused and it should be lazily fetched since we're dealing with
 *far* larger amounts of data here. My suggestion would be for the
 constructor to accept a file object, then read from it on demand. This
 will let us have constant memory usage.
 >
 > Speaking of which does this now work when processing a full consensus?
 I'm not spotting the test we talked about for this though maybe I'm
 missing it...
 It does. The test_cached_consensus is parsing the full consensus. I set it
 to fail if it uses > 100m, and it failed randomly a few times, so I made
 that 200m.
 Though, I still don't know why :/

 > This style of helper class generally hurts code readability. I made the
 ControlLine class because controller responses are generally structured
 like a stack, and I wanted to avoid the numerous regexes that TorCtl uses
 to handle it. That said, I'd love to get rid of the ControlLine class if
 we had a good alternative.
 Initially, I started with editing the existing helper methods, but, it
 seemed easier to do this then.

 > So in summary:
 > * I'd suggest lazily evaluating what you need from a file object. See
 the server or extrainfo descriptor's parse_file() functions - they read
 from a file line by line and yield descriptors so its memory usage isn't
 greater than a single descriptor.
 Done.

 > * I'd also suggest replacing this helper class with helper functions
 that give you back python primitives. This module has several such helpers
 that we might be able to expand to meet your purposes.
 Agreed, and done. I cheated by using StringIO objects in most places. But,
 I don't think that should affect performance too much. I wouldn't mind
 replacing them with lists of strings though.

 > Thoughts?
 >
 > > stem/descriptor/networkstatus.py
 > > +  :var bool is_valid: **\*** router is valid
 > > +  :var bool is_guard: **\*** router is suitable for use as an entry
 guard
 > > +  :var bool is_named: **\*** router is named
 >
 > Interesting, I hadn't thought of doing a series of boolean flags for
 this. However, I'd suggest adding a Flags enum and making this a tuple
 instead. Users can do the same with fewer methods and also query the
 RouterDescriptor for all of its flags (which you can't do with these
 booleans). This would look like...
 Yeah... this definitely makes a lot more sense.


 Replying to [comment:3 atagar]:
 > Oh, forgot to mention - with the flags we'll need both a 'flags' and
 'unknown_flags' attribute. Anything that isn't in the Flag enum would go
 into 'unknown_flags'. We should also assert in the test_cached_consensus()
 that this is empty so we're alerted if Tor adds a new flag.
 Since we aren't doing anything special with the flags, we could just reuse
 the known_flags data from the NetworkStatusDocument. Folks then won't have
 to update stem when a new flag is added.

-- 
Ticket URL: <https://trac.torproject.org/projects/tor/ticket/6569#comment:4>
Tor Bug Tracker & Wiki <https://trac.torproject.org/>
The Tor Project: anonymity online