[tor-bugs] #6569 [Stem]: Stem network status document parsing

Wed Aug 15 17:34:03 UTC 2012

#6569: Stem network status document parsing
--------------------+-------------------------------------------------------
 Reporter:  neena   |          Owner:  neena       
     Type:  task    |         Status:  needs_review
 Priority:  normal  |      Milestone:              
Component:  Stem    |        Version:              
 Keywords:          |         Parent:              
   Points:          |   Actualpoints:              
--------------------+-------------------------------------------------------
Changes (by neena):

  * status:  needs_revision => needs_review

Comment:

 Replying to [comment:5 atagar]:
 > > +      desc =
 stem.descriptor.networkstatus.NetworkStatusDocument(descriptor_file.read())
 >
 > Again this is frontloading the read rather than passing the file object
 itself to the NetworkStatusDocument. This means that the memory usage is,
 at a minimum, as large as the whole consensus file. If we process
 individual consensus entries then that'll be reduced to the size of an
 individual consensus entry.
 >
 > I realize that this is a large change but it's likely necessary to avoid
 swapping on some systems.
 >
 Eek. fixed.
 As it is right now, NetworkStatusDocument parses descriptor data.
 stem.descriptor.networkstatus.parse_file does handle the router
 descriptors individually. So, if ever there is a need to parse
 NetworkStatusDocuments from a string, we can use the class directly.

 > I probably would have done the same though, on reflection, this can
 actually be reduced to...
 >
 > {{{
 > if line.startswith("opt "):
 >   line = line[4:]
 >
 > keyword = line.split(" ", 1)[0].strip()
 > }}}
 >
 > ... this is because split doesn't need the whitespace to have a first
 element...
 Fixed.

 > > +try:
 > > +  from cStringIO import StringIO
 > > +except:
 > > +  from StringIO import StringIO
 >
 > Out of curiosity when is cStringIO unavailable? Iirc we're using it
 elsewhere too so if it, say, isn't available on python 2.5 then that would
 be good to know.
 From what I understand cStringIO is unavailable on 'some platforms'. Not
 sure what those are.

 > Duplicate assignment. Personally I think the first is more readable
 though up to you. If you like the second better then please put the flag
 list into a temporary variable first to avoid having a single enormous
 line.
 Fixed.

 > Also, minor nitpick: when doing a multi-line block please place the last
 parentheses at the same indentation as the block. Ie...
 and fixed.

 Replying to [comment:6 atagar]:
 > > stem/descriptor/__init__.py
 > > # Metrics descriptor handling. These contain a single descriptor per
 file.
 >
 > This comment of mine is no longer accurate with this addition. Maybe
 this would now be a little nicer as chained iterators...

 It is accurate now. I modified stem.descriptor.networkstatus.parse_file to
 return NetworkStatusDocument objects instead of iterating over the router
 descriptors.

 > Lets use _peek_line() here. We could probably replace this function
 with...
 done

 > Missing a parameter and out of order. The opt handling in this method is
 more complicated than it needs to be - like _peek_keyword() lets just
 strip it off of the line variable if it's there.
 ugh, fixed.

 > > +  elif line.startswith("opt"):
 > > +    # if this is something new we don't recognize
 > > +    # ignore it and go to the next line
 > > +    descriptor_file.readline()
 > > +    return _read_keyword_line(keyword, descriptor_file, optional)
 >
 > This doesn't look right. I think you meant "elif optional:", having this
 start with 'opt' really doesn't mean anything.
 If it starts with "opt ", it should mean it's something from the future
 that the parser doesn't understand yet. We should be ignoring that and
 using the next line.

 > Also, when checking for an 'opt' prefix please do a startswith() check
 for 'opt ' so you don't match against, say, our new "optometrist" keyword.
 :P
 Eek, yes, I thought I fixed that.

 > These pydocs look really identical to me. How to these functions differ?
 Can we get rid of _read_keyword_line_str()?
 I added _read_keyword_line_str to handle lists of strings, because I
 didn't want to use StringIO initially. We could get rid of it and convert
 everything to use StringIO. But, I'd rather leave it as it is, and
 eventually convert everything to use lists of strings internally.

 > > +def _read_until_keywords(keywords, descriptor_file, inclusive =
 False, ignore_first = False):
 > > ...
 > > +  :param bool ignore_first: doesn't check if the first line read has
 one of the given keywords
 >
 > This new parameter seems oddly specific. The problem that you're trying
 to solve is that a router status entry starts with a 'r ' line and
 continues until... well, that part isn't defined in the spec. At present
 you're hardcoding our check to be for...
 >
 > > +_router_desc_end_kws = ["r", "bandwidth-weights", "directory-footer",
 "directory-signature"]
 >
 > This strikes me as being both a bit hacky and vulnerable to breakage if
 tor adds a new footer value. I'd suggest asking for clarification from
 Nick - this looks like a spec bug to me.

 It isn't exactly hacky. From the order in the dir-spec the first line
 after the router descriptors will be
 1. directory-footer if consensus method >= 9
 2. bandwidth-weights if it's a consensus (at most for consensus, it says)
 3. directory-signature otherwise

 If tor adds a new footer value, it would start with the "opt" keyword.
 This would be ignored because of "  elif line.startswith("opt"):".

 I also committed the microdescriptor flavoured consensus parsing code.
 Should I create a seperate ticket for that? or will this do?

-- 
Ticket URL: <https://trac.torproject.org/projects/tor/ticket/6569#comment:7>
Tor Bug Tracker & Wiki <https://trac.torproject.org/>
The Tor Project: anonymity online