[tor-bugs] #6569 [Stem]: Stem network status document parsing

Tor Bug Tracker & Wiki torproject-admin at torproject.org
Tue Aug 7 15:03:37 UTC 2012


#6569: Stem network status document parsing
--------------------+-------------------------------------------------------
 Reporter:  neena   |          Owner:  neena         
     Type:  task    |         Status:  needs_revision
 Priority:  normal  |      Milestone:                
Component:  Stem    |        Version:                
 Keywords:          |         Parent:                
   Points:          |   Actualpoints:                
--------------------+-------------------------------------------------------
Changes (by atagar):

  * status:  needs_review => needs_revision


Comment:

 Hi Ravi. Great to see that this is progressing!

 > stem/descriptor/__init__.py
 > +  import stem.descriptor.networkstatus_descriptor

 Wrong import and mix of new/old usage. I'm guessing that the tests don't
 cover parse_file()? If it is being tested then this might be due to a
 phantom networkstatus_descriptor.pyc (I hate it when that happens).

 > +def _strptime(string, validate = True, optional = False):
 > +  try:
 > +    return datetime.datetime.strptime(string, "%Y-%m-%d %H:%M:%S")
 > +  except ValueError, exc:
 > +    if validate or not optional: raise exc

 Please end with "else: return None". I know this is the default behavior
 when there isn't a return, but explicit is better than implicit. :)

 > +class DescriptorParser:
 > +  """
 > +  Helper class to parse documents.
 > +
 > +  :var str line: current line to be being parsed
 > +  :var list lines: list of remaining lines to be parsed
 > +  """

 Needs to extend object. I'm also kinda worried about how it holds on to a
 copy of both _raw_content and lines. Ideally it would have neither. I
 realize that the Descriptor class has a _raw_content attribute for its
 __str__() method. In DescriptorParser's case though _raw_content is
 presently unused and it should be lazily fetched since we're dealing with
 *far* larger amounts of data here. My suggestion would be for the
 constructor to accept a file object, then read from it on demand. This
 will let us have constant memory usage.

 Speaking of which does this now work when processing a full consensus? I'm
 not spotting the test we talked about for this though maybe I'm missing
 it...

 This class reminds me a lot of the ControlLine class. I didn't need this
 style of helper for the server or extrainfo descriptor classes... are you
 sure that it's really necessary here? The descriptors do a lot of this
 (first pass parsing, 'opt' prefix handling, etc) via the
 _get_descriptor_components() helper, and your read_block() method reminds
 me a bit of the _get_pseudo_pgp_block() helper.

 This style of helper class generally hurts code readability. I made the
 ControlLine class because controller responses are generally structured
 like a stack, and I wanted to avoid the numerous regexes that TorCtl uses
 to handle it. That said, I'd love to get rid of the ControlLine class if
 we had a good alternative.

 So in summary:
 * I'd suggest lazily evaluating what you need from a file object. See the
 server or extrainfo descriptor's parse_file() functions - they read from a
 file line by line and yield descriptors so its memory usage isn't greater
 than a single descriptor.

 * I'd also suggest replacing this helper class with helper functions that
 give you back python primitives. This module has several such helpers that
 we might be able to expand to meet your purposes.

 Thoughts?

 > stem/descriptor/networkstatus.py
 > +  :var bool is_valid: **\*** router is valid
 > +  :var bool is_guard: **\*** router is suitable for use as an entry
 guard
 > +  :var bool is_named: **\*** router is named

 Interesting, I hadn't thought of doing a series of boolean flags for this.
 However, I'd suggest adding a Flags enum and making this a tuple instead.
 Users can do the same with fewer methods and also query the
 RouterDescriptor for all of its flags (which you can't do with these
 booleans). This would look like...

 {{{
 Flag = stem.util.enum.Enum(
   ("GUARD", "Guard"),
   ("NAMED", "Named"),
   ...
 )
 }}}

 ... then used like...

 {{{
 >>> my_router_descriptor = RouterDescriptor(...)
 >>> print Flag.NAMED in my_router_descriptor.flags
 True

 >>> for flag in my_router_descriptor.flags:
 ...   print flag
 Named
 Guard
 Valid
 }}}

 I'll hold off on reviewing the rest for the moment since the
 DescriptorParser might lead to quite a bit of refactoring (sorry about
 that!).

-- 
Ticket URL: <https://trac.torproject.org/projects/tor/ticket/6569#comment:2>
Tor Bug Tracker & Wiki <https://trac.torproject.org/>
The Tor Project: anonymity online


More information about the tor-bugs mailing list