[tor-bugs] #9380 [BridgeDB]: BridgeDB should use stem for parsing descriptors according to torspec

Tue Aug 26 10:39:29 UTC 2014

#9380: BridgeDB should use stem for parsing descriptors according to torspec
-------------------------+-------------------------------------------------
     Reporter:  sysrqb   |      Owner:  isis
         Type:           |     Status:  needs_revision
  enhancement            |  Milestone:
     Priority:  normal   |    Version:
    Component:           |   Keywords:  stem,bridgedb-0.2.x,bridgedb-
  BridgeDB               |  parsers
   Resolution:           |  Parent ID:
Actual Points:           |
       Points:           |
-------------------------+-------------------------------------------------

Comment (by isis):

 Replying to [comment:16 atagar]:
 > Hi isis, here's a quick code review for the file you pointed me toward
 ([https://gitweb.torproject.org/user/isis/bridgedb.git/blob/refs/heads/fix/9380-stem_r3:/lib/bridgedb/parse/descriptors.py
 descriptors.py])...
 >
 > ----
 >
 > {{{
 > if not descriptors.startswith("published"):
 >   precise = datetime.datetime.now().isoformat(sep=chr(0x20))
 >   timestamp = precise.rsplit('.', 1)[0]
 >   descriptors = "published {t}\n{d}".format(t=timestamp, d=descriptors)
 > }}}
 >
 > Interesting, any reason to do 'sep=chr(0x20)' rather than 'sep=' !''?
 >

 tl;dr: The `datetime` module is braindamaged. Or Python2.x string types
 are braindamaged. Or both.

 This was done to fix a unicode errors resulting on some of the test boxes,
 via the implicit concatenation/formatting of a string (datetime only
 returns ascii strings) with what might be a `unicode` string on some
 systems. Though which system was having the error, and if it was ponticum
 (the BridgeDB production server), I do not remember.

 > Regardless, this would be simpler as...
 >
 > {{{
 > if not descriptors.startswith("published"):
 >   published_line = "published %s\n" %
 datetime.datetime.now().strftime('%Y-%m-%d %H:%M:%S')
 >   descriptor = published_line + descriptor
 > }}}

 Using `strftime()` certainly looks nicer! I'll have to try that and see if
 I hit the same unicode problem again. Or just
 [https://trac.torproject.org/projects/tor/ticket/12951#ticket fix the
 problem] that the bridge networkstatus documents don't have a `published`
 line. :)

 > ----
 >
 > {{{
 > routers = networkstatus.BridgeNetworkStatusDocument(descriptors,
 >   validate=validate)
 > }}}
 >
 > This is something we've discussed several times but are you sure
 BridgeNetworkStatusDocument is what you want? Your earlier comment
 indicates that these files are simply a list of router status entries (no
 consensus header or footer). If that's the case then you could simplify
 this whole function to be something like...
 >
 > {{{
 > def parseNetworkStatusFile(filename, validate=True):
 >   with open(filename) as descriptor_file:
 >     return list(stem.descriptor.router_status_entry._parse_file(
 >       descriptor_file,
 >       validate,
 >       entry_class =
 stem.descriptor.router_status_entry.RouterStatusEntryV2,
 >     ))
 > }}}
 >
 > This will provide you a list of RouterStatusEntryV2 entries. I think you
 really want RouterStatusEntryV3, but your current code is providing you
 RouterStatusEntryV2 instances (maybe a bug).

 Thanks for the clarification! (Thanks also for causing me to
 [https://trac.torproject.org/projects/tor/ticket/12953 find a bug in
 leekspin].)

 Either way (with your code or mine), Stem is getting angry about the
 `flag-thresholds` line in the networkstatus file:

 {{{
 ValueError: Router status entries (v3) must have a 'r' line:
 flag-thresholds stable-uptime=613624 stable-mtbf=2488616 fast-speed=15000
 guard-wfu=98.000% guard-tk=691200 guard-bw-inc-exits=55000 guard-bw-exc-
 exits=55000 enough-mtbf=1 ignoring-advertised-bws=0
 }}}

 It stops getting angry if the networkstatus document isn't validated, i.e.
 `parseNetworkStatusFile(filename, validate=False`. (We should probably
 just leave it at `False` for this one, since
 [https://trac.torproject.org/projects/tor/ticket/12254 there isn't a
 BridgeAuth signature] or anything else, other than sound formatting, to
 validate anyway.

 > ----
 >
 > {{{
 > routers = [router for router in document]
 > return routers
 > }}}
 >
 > This could also be...
 >
 > {{{
 > return list(document)
 > }}}
 >

 Oh, fancy! Thanks, atagar! I've changed it to your version.

 > ----
 >
 > {{{
 > for descriptor in descriptors:
 >   router = descriptors.pop(descriptors.index(descriptor))
 > }}}
 >
 > This has a couple issues...
 >
 >  * You're modifying a list while you iterate over it. This is a bad idea
 in just about any language (iirc java would raise a
 ConcurrentModificationException).
 >  * Doing this is both pointless and very, very inefficient. You're doing
 a O(n) iteration over the descripters then doing a O(n) lookup to find the
 index then another O(n) modification. In other words this is O(n^2) twice.
 >
 > This would be both simpler and more efficient as...
 >
 > {{{
 > while descriptors:
 >   router = descriptors.pop()
 > }}}
 >
 > That said this whole function can simplified as...
 >
 > {{{
 > def deduplicate(descriptors):
 >   # Dictionary of {fingerprint: descriptor} for the newest descriptor
 with a
 >   # given fingerprint.
 >
 >   fp_to_descriptor = {}
 >   duplicates = []
 >
 >   for desc in descriptors:
 >     if desc.fingerprint in fp_to_descriptor:
 >       conflict = fp_to_descriptor[desc.fingerprint]
 >
 >       if desc.published > conflict.published:
 >         fp_to_descriptor[desc.fingerprint] = desc
 >         duplicates.append(conflict)
 >       else:
 >         duplicates.append(desc)
 >     else:
 >       fp_to_descriptor[desc.fingerprint] = desc
 >
 >   return fp_to_descriptor.values()
 > }}}
 >

 This has been refactored as a mix between my original code and yours
 (because I wanted to keep the warnings for misbehaving OR implementations
 which might be submitting multiple bridge descriptors as we likely have
 seen in the past):

 {{{
 def deduplicate(descriptors):
     """Deduplicate some descriptors, returning only the newest for each
 router.

     .. note:: If two descriptors for the same router are discovered, AND
 both
         descriptors have the **same** published timestamp, then the
 router's
         fingerprint WILL BE LOGGED ON PURPOSE, because we assume that
 router
         to be malicious (deliberately, or unintentionally).

     :param list descriptors: A list of
         :api:`stem.descriptor.server_descriptor.RelayDescriptor`s,
 :api:`stem.descriptor.extrainfo_descriptor.BridgeExtraInfoDescriptor`s,
         :api:`stem.descriptor.router_status_entry.RouterStatusEntryV2`s.
     """
     duplicates = {}
     nonDuplicates = {}

     for descriptor in descriptors:
         fingerprint = descriptor.fingerprint

         logging.debug("Deduplicating %s descriptor for router %s"
                       % (str(descriptor.__class__).rsplit('.', 1)[1],
                          safelog.logSafely(fingerprint)))

         if fingerprint in nonDuplicates.keys():
             # We already found a descriptor for this fingerprint:
             conflict = nonDuplicates[fingerprint]

             # If the descriptor we are currently parsing is newer than the
             # last one we found:
             if descriptor.published > conflict.published:
                 # And if this is the first duplicate we've found for this
                 # router, then create a list in the ``duplicates``
 dictionary
                 # for the router:
                 if not fingerprint in duplicates.keys():
                     duplicates[fingerprint] = list()
                 # Add this to the list of duplicates for this router:
                 duplicates[fingerprint].append(conflict)
                 # Finally, put the newest descriptor in the
 ``nonDuplicates``
                 # dictionary:
                 nonDuplicates[fingerprint] = descriptor
             # Same thing, but this time the one we're parsing is older:
             elif descriptor.published < conflict.published:
                 if not fingerprint in duplicates.keys():
                     duplicates[fingerprint] = list()
                 duplicates[fingerprint].append(descriptor)
             # This *shouldn't* happen. It would mean that two descriptors
 for
             # the same router had the same timestamps, probably meaning
 there
             # is a severely-messed up OR implementation out there. Let's
 log
             # its fingerprint (no matter what!) so that we can look up its
             # ``platform`` line in its server-descriptor and tell whoever
             # wrote that code that they're probably (D)DOSing the Tor
 network.
             else:
                 logging.warn(("Duplicate descriptor with identical
 timestamp "
                               "(%s) for router with fingerprint '%s'!")
                              % (descriptor.published, fingerprint))

         # Hoorah! No duplicates! (yet...)
         else:
             nonDuplicates[fingerprint] = descriptor

     logging.info("Descriptor deduplication finished.")
     logging.info("Number of duplicates: %d" % len(duplicates))
     for (fingerprint, dittos) in duplicates.items():
         logging.info("  For %s: %d duplicates"
                      % (safelog.logSafely(fingerprint), len(dittos)))
     logging.info("Number of non-duplicates: %d" % len(nonDuplicates))

     return nonDuplicates
 }}}

 > ----
 >
 > {{{
 > descriptors.extend([router for router in document])
 > }}}
 >
 > Again, there's no reason to do list comprehension to just convert an
 iterable to a list. Calling...
 >
 > {{{
 > [router for router in document]
 > }}}
 >
 > ... is the same as...
 >
 > {{{
 > list(document)
 > }}}
 >

 Also changed.

 > Cheers! -Damian

 Thanks for the code review!h

--
Ticket URL: <https://trac.torproject.org/projects/tor/ticket/9380#comment:17>
Tor Bug Tracker & Wiki <https://trac.torproject.org/>
The Tor Project: anonymity online