[tor-bugs] #9380 [BridgeDB]: BridgeDB should use Stem for parsing descriptors

Tor Bug Tracker & Wiki blackhole at torproject.org
Mon Feb 23 04:11:48 UTC 2015


#9380: BridgeDB should use Stem for parsing descriptors
-------------------------+-------------------------------------------------
     Reporter:  sysrqb   |      Owner:  isis
         Type:           |     Status:  needs_review
  enhancement            |  Milestone:
     Priority:  blocker  |    Version:
    Component:           |   Keywords:  stem, bridgedb-0.3.0, bridgedb-
  BridgeDB               |  parsers, isis2014Q3Q4, isisExB
   Resolution:           |  Parent ID:
Actual Points:           |
       Points:           |
-------------------------+-------------------------------------------------

Comment (by atagar):

 Hi Isis, here's some quick feedback...

 -----

 {{{
 from stem.descriptor import extrainfo_descriptor
 from stem.descriptor import server_descriptor
 }}}

 Unused.

 -----

 {{{
 def parseServerDescriptorsFile(filename, validate=True):
     """Parse a file which contains ``@type bridge-server-descriptor``s.

     .. note:: ``validate`` defaults to ``False`` because there appears to
 be a
         bug in Leekspin, the fake descriptor generator, where Stem thinks
 the
         fingerprint doesn't match the key…

     .. note:: We have to lie to Stem, pretending that these are ``@type
         server-descriptor``s, **not** ``@type bridge-server-descriptor``s.
         See ticket `#11257`_.

     .. _`#11257`: https://trac.torproject.org/projects/tor/ticket/11257

     :param str filename: The file to parse descriptors from.
     :param bool validate: Whether or not to validate descriptor
         contents. (default: ``False``)
     :rtype: list
     :returns: A list of
         :api:`stem.descriptor.server_descriptor.RelayDescriptor`s.
     """
     logging.info("Parsing server descriptors with Stem: %s" % filename)
     descriptorType = 'server-descriptor 1.0'
     document = parse_file(filename, descriptorType, validate=validate)
     routers = list(document)
     return routers
 }}}

 This whole function is just 25 lines for doing...

 {{{
 routers = list(parse_file(filename, 'server-descriptor 1.0', validate =
 validate))
 }}}

 Really doesn't seem like it's providing any value.

 ----

 {{{
 lib/bridgedb/configure.py:

 def loadConfig(configFile=None, configCls=None):
     """Load configuration settings on top of the current settings.
     ...
 }}}

 You might want to take a look at...

   https://stem.torproject.org/api/util/conf.html

 This is a module used by stem and arm so they can have nice config
 files...

   https://gitweb.torproject.org/stem.git/tree/test/settings.cfg
   https://gitweb.torproject.org/arm.git/tree/armrc.sample

 It provides type inferencing, defaults, etc without doing exec() as you
 presently are (... which is almost never a great idea). You can then use
 config_dict() to specify the config values each file cares about and their
 default values. For example...

   https://gitweb.torproject.org/arm.git/tree/arm/header_panel.py#n27

 Then when the config for that singleton is read all the configurations get
 the updated values.

 ----

 {{{
 document = parse_file(filename, descriptorType, validate=validate)

 try:
   for router in document:
     descriptors.append(router)
 except (ValueError, ProtocolError) as error:
   logging.error(
     ("Stem exception while parsing extrainfo descriptor from "
     "file '%s':\n%s") % (filename, str(error)))
     _copyUnparseableDescriptorFile(filename)
 }}}

 No reason for the for loop, same as...

 {{{
 descriptors = list(parse_file(filename, descriptorType,
 validate=validate))
 }}}

 ----

 {{{
 def parseNetworkStatusFile(filename, validate=True, skipAnnotations=True,
                            descriptorClass=RouterStatusEntryV3):
     """Parse a file which contains an ``@type bridge-networkstatus``
 document.

     See `ticket #12254 <https://bugs.torproject.org/12254>`__ for why
     networkstatus-bridges documents don't look anything like the
 networkstatus
     v2 documents that they are purported to look like. They are missing
 all
     headers, and the entire footer including authority signatures.
 }}}

 Actually, you could probably simply parse it like a normal network status
 document but without validation.

 ----

 {{{
 def deduplicate(descriptors):
     """Deduplicate some descriptors, returning only the newest for each
 router.

     .. note:: If two descriptors for the same router are discovered, AND
 both
         descriptors have the **same** published timestamp, then the
 router's
         fingerprint WILL BE LOGGED ON PURPOSE, because we assume that
 router
         to be malicious (deliberately, or unintentionally).

     :param list descriptors: A list of
         :api:`stem.descriptor.server_descriptor.RelayDescriptor`s,
 :api:`stem.descriptor.extrainfo_descriptor.BridgeExtraInfoDescriptor`s,
         :api:`stem.descriptor.router_status_entry.RouterStatusEntryV2`s.
     """
     duplicates = {}
     nonDuplicates = {}

     for descriptor in descriptors:
         fingerprint = descriptor.fingerprint

         logging.debug("Deduplicating %s descriptor for router %s"
                       % (str(descriptor.__class__).rsplit('.', 1)[1],
                          safelog.logSafely(fingerprint)))

         if fingerprint in nonDuplicates.keys():
             # We already found a descriptor for this fingerprint:
             conflict = nonDuplicates[fingerprint]

             # If the descriptor we are currently parsing is newer than the
             # last one we found:
             if descriptor.published > conflict.published:
                 # And if this is the first duplicate we've found for this
                 # router, then create a list in the ``duplicates``
 dictionary
                 # for the router:
                 if not fingerprint in duplicates.keys():
                     duplicates[fingerprint] = list()
                 # Add this to the list of duplicates for this router:
                 duplicates[fingerprint].append(conflict)
                 # Finally, put the newest descriptor in the
 ``nonDuplicates``
                 # dictionary:
                 nonDuplicates[fingerprint] = descriptor
             # Same thing, but this time the one we're parsing is older:
             elif descriptor.published < conflict.published:
                 if not fingerprint in duplicates.keys():
                     duplicates[fingerprint] = list()
                 duplicates[fingerprint].append(descriptor)
             # This *shouldn't* happen. It would mean that two descriptors
 for
             # the same router had the same timestamps, probably meaning
 there
             # is a severely-messed up OR implementation out there. Let's
 log
             # its fingerprint (no matter what!) so that we can look up its
             # ``platform`` line in its server-descriptor and tell whoever
             # wrote that code that they're probably (D)DOSing the Tor
 network.
             else:
                 raise DescriptorWarning(
                     ("Duplicate descriptor with identical timestamp (%s)
 for "
                      "router with fingerprint '%s'!")
                     % (descriptor.published, fingerprint))

         # Hoorah! No duplicates! (yet...)
         else:
             nonDuplicates[fingerprint] = descriptor

     logging.info("Descriptor deduplication finished.")
     logging.info("Number of duplicates: %d" % len(duplicates))
     for (fingerprint, dittos) in duplicates.items():
         logging.info("  For %s: %d duplicates"
                      % (safelog.logSafely(fingerprint), len(dittos)))
     logging.info("Number of non-duplicates: %d" % len(nonDuplicates))

     return nonDuplicates
 }}}

 Off the top of my head think the following does the same as all the
 above...

 {{{
 # makes a map of {fingerprint => [all descriptors with that fingerprint]}

 fingerprint_to_descriptors = {}

 for desc in descriptors:
   fingerprint_to_descriptors.setdefault(desc.fingerprint,
 []).append(descriptor)

 # provide back the latest

 return [sorted(descriptors, key = lambda d: d.published, reversed =
 True)[0] for descriptors in fingerprint_to_descriptors.values()]
 }}}

--
Ticket URL: <https://trac.torproject.org/projects/tor/ticket/9380#comment:40>
Tor Bug Tracker & Wiki <https://trac.torproject.org/>
The Tor Project: anonymity online


More information about the tor-bugs mailing list