[tor-bugs] #9380 [BridgeDB]: BridgeDB should use stem for parsing descriptors according to torspec
Tor Bug Tracker & Wiki
blackhole at torproject.org
Tue Aug 26 10:39:29 UTC 2014
#9380: BridgeDB should use stem for parsing descriptors according to torspec
-------------------------+-------------------------------------------------
Reporter: sysrqb | Owner: isis
Type: | Status: needs_revision
enhancement | Milestone:
Priority: normal | Version:
Component: | Keywords: stem,bridgedb-0.2.x,bridgedb-
BridgeDB | parsers
Resolution: | Parent ID:
Actual Points: |
Points: |
-------------------------+-------------------------------------------------
Comment (by isis):
Replying to [comment:16 atagar]:
> Hi isis, here's a quick code review for the file you pointed me toward
([https://gitweb.torproject.org/user/isis/bridgedb.git/blob/refs/heads/fix/9380-stem_r3:/lib/bridgedb/parse/descriptors.py
descriptors.py])...
>
> ----
>
> {{{
> if not descriptors.startswith("published"):
> precise = datetime.datetime.now().isoformat(sep=chr(0x20))
> timestamp = precise.rsplit('.', 1)[0]
> descriptors = "published {t}\n{d}".format(t=timestamp, d=descriptors)
> }}}
>
> Interesting, any reason to do 'sep=chr(0x20)' rather than 'sep=' !''?
>
tl;dr: The `datetime` module is braindamaged. Or Python2.x string types
are braindamaged. Or both.
This was done to fix a unicode errors resulting on some of the test boxes,
via the implicit concatenation/formatting of a string (datetime only
returns ascii strings) with what might be a `unicode` string on some
systems. Though which system was having the error, and if it was ponticum
(the BridgeDB production server), I do not remember.
> Regardless, this would be simpler as...
>
> {{{
> if not descriptors.startswith("published"):
> published_line = "published %s\n" %
datetime.datetime.now().strftime('%Y-%m-%d %H:%M:%S')
> descriptor = published_line + descriptor
> }}}
Using `strftime()` certainly looks nicer! I'll have to try that and see if
I hit the same unicode problem again. Or just
[https://trac.torproject.org/projects/tor/ticket/12951#ticket fix the
problem] that the bridge networkstatus documents don't have a `published`
line. :)
> ----
>
> {{{
> routers = networkstatus.BridgeNetworkStatusDocument(descriptors,
> validate=validate)
> }}}
>
> This is something we've discussed several times but are you sure
BridgeNetworkStatusDocument is what you want? Your earlier comment
indicates that these files are simply a list of router status entries (no
consensus header or footer). If that's the case then you could simplify
this whole function to be something like...
>
> {{{
> def parseNetworkStatusFile(filename, validate=True):
> with open(filename) as descriptor_file:
> return list(stem.descriptor.router_status_entry._parse_file(
> descriptor_file,
> validate,
> entry_class =
stem.descriptor.router_status_entry.RouterStatusEntryV2,
> ))
> }}}
>
> This will provide you a list of RouterStatusEntryV2 entries. I think you
really want RouterStatusEntryV3, but your current code is providing you
RouterStatusEntryV2 instances (maybe a bug).
Thanks for the clarification! (Thanks also for causing me to
[https://trac.torproject.org/projects/tor/ticket/12953 find a bug in
leekspin].)
Either way (with your code or mine), Stem is getting angry about the
`flag-thresholds` line in the networkstatus file:
{{{
ValueError: Router status entries (v3) must have a 'r' line:
flag-thresholds stable-uptime=613624 stable-mtbf=2488616 fast-speed=15000
guard-wfu=98.000% guard-tk=691200 guard-bw-inc-exits=55000 guard-bw-exc-
exits=55000 enough-mtbf=1 ignoring-advertised-bws=0
}}}
It stops getting angry if the networkstatus document isn't validated, i.e.
`parseNetworkStatusFile(filename, validate=False`. (We should probably
just leave it at `False` for this one, since
[https://trac.torproject.org/projects/tor/ticket/12254 there isn't a
BridgeAuth signature] or anything else, other than sound formatting, to
validate anyway.
> ----
>
> {{{
> routers = [router for router in document]
> return routers
> }}}
>
> This could also be...
>
> {{{
> return list(document)
> }}}
>
Oh, fancy! Thanks, atagar! I've changed it to your version.
> ----
>
> {{{
> for descriptor in descriptors:
> router = descriptors.pop(descriptors.index(descriptor))
> }}}
>
> This has a couple issues...
>
> * You're modifying a list while you iterate over it. This is a bad idea
in just about any language (iirc java would raise a
ConcurrentModificationException).
> * Doing this is both pointless and very, very inefficient. You're doing
a O(n) iteration over the descripters then doing a O(n) lookup to find the
index then another O(n) modification. In other words this is O(n^2) twice.
>
> This would be both simpler and more efficient as...
>
> {{{
> while descriptors:
> router = descriptors.pop()
> }}}
>
> That said this whole function can simplified as...
>
> {{{
> def deduplicate(descriptors):
> # Dictionary of {fingerprint: descriptor} for the newest descriptor
with a
> # given fingerprint.
>
> fp_to_descriptor = {}
> duplicates = []
>
> for desc in descriptors:
> if desc.fingerprint in fp_to_descriptor:
> conflict = fp_to_descriptor[desc.fingerprint]
>
> if desc.published > conflict.published:
> fp_to_descriptor[desc.fingerprint] = desc
> duplicates.append(conflict)
> else:
> duplicates.append(desc)
> else:
> fp_to_descriptor[desc.fingerprint] = desc
>
> return fp_to_descriptor.values()
> }}}
>
This has been refactored as a mix between my original code and yours
(because I wanted to keep the warnings for misbehaving OR implementations
which might be submitting multiple bridge descriptors as we likely have
seen in the past):
{{{
def deduplicate(descriptors):
"""Deduplicate some descriptors, returning only the newest for each
router.
.. note:: If two descriptors for the same router are discovered, AND
both
descriptors have the **same** published timestamp, then the
router's
fingerprint WILL BE LOGGED ON PURPOSE, because we assume that
router
to be malicious (deliberately, or unintentionally).
:param list descriptors: A list of
:api:`stem.descriptor.server_descriptor.RelayDescriptor`s,
:api:`stem.descriptor.extrainfo_descriptor.BridgeExtraInfoDescriptor`s,
:api:`stem.descriptor.router_status_entry.RouterStatusEntryV2`s.
"""
duplicates = {}
nonDuplicates = {}
for descriptor in descriptors:
fingerprint = descriptor.fingerprint
logging.debug("Deduplicating %s descriptor for router %s"
% (str(descriptor.__class__).rsplit('.', 1)[1],
safelog.logSafely(fingerprint)))
if fingerprint in nonDuplicates.keys():
# We already found a descriptor for this fingerprint:
conflict = nonDuplicates[fingerprint]
# If the descriptor we are currently parsing is newer than the
# last one we found:
if descriptor.published > conflict.published:
# And if this is the first duplicate we've found for this
# router, then create a list in the ``duplicates``
dictionary
# for the router:
if not fingerprint in duplicates.keys():
duplicates[fingerprint] = list()
# Add this to the list of duplicates for this router:
duplicates[fingerprint].append(conflict)
# Finally, put the newest descriptor in the
``nonDuplicates``
# dictionary:
nonDuplicates[fingerprint] = descriptor
# Same thing, but this time the one we're parsing is older:
elif descriptor.published < conflict.published:
if not fingerprint in duplicates.keys():
duplicates[fingerprint] = list()
duplicates[fingerprint].append(descriptor)
# This *shouldn't* happen. It would mean that two descriptors
for
# the same router had the same timestamps, probably meaning
there
# is a severely-messed up OR implementation out there. Let's
log
# its fingerprint (no matter what!) so that we can look up its
# ``platform`` line in its server-descriptor and tell whoever
# wrote that code that they're probably (D)DOSing the Tor
network.
else:
logging.warn(("Duplicate descriptor with identical
timestamp "
"(%s) for router with fingerprint '%s'!")
% (descriptor.published, fingerprint))
# Hoorah! No duplicates! (yet...)
else:
nonDuplicates[fingerprint] = descriptor
logging.info("Descriptor deduplication finished.")
logging.info("Number of duplicates: %d" % len(duplicates))
for (fingerprint, dittos) in duplicates.items():
logging.info(" For %s: %d duplicates"
% (safelog.logSafely(fingerprint), len(dittos)))
logging.info("Number of non-duplicates: %d" % len(nonDuplicates))
return nonDuplicates
}}}
> ----
>
> {{{
> descriptors.extend([router for router in document])
> }}}
>
> Again, there's no reason to do list comprehension to just convert an
iterable to a list. Calling...
>
> {{{
> [router for router in document]
> }}}
>
> ... is the same as...
>
> {{{
> list(document)
> }}}
>
Also changed.
> Cheers! -Damian
Thanks for the code review!h
--
Ticket URL: <https://trac.torproject.org/projects/tor/ticket/9380#comment:17>
Tor Bug Tracker & Wiki <https://trac.torproject.org/>
The Tor Project: anonymity online
More information about the tor-bugs
mailing list