[metrics-bugs] #30369 [Metrics/Library]: Fix regular expression in descriptor parser to correctly recognize bandwidth files

Tor Bug Tracker & Wiki blackhole at torproject.org
Thu May 2 18:53:27 UTC 2019


#30369: Fix regular expression in descriptor parser to correctly recognize
bandwidth files
---------------------------------+----------------------
     Reporter:  karsten          |      Owner:  karsten
         Type:  defect           |     Status:  assigned
     Priority:  Medium           |  Milestone:
    Component:  Metrics/Library  |    Version:
     Severity:  Normal           |   Keywords:
Actual Points:                   |  Parent ID:
       Points:                   |   Reviewer:
      Sponsor:                   |
---------------------------------+----------------------
 We're using a regular expression on the first 100 characters of a
 descriptor to recognize bandwidth files. More specifically, if a
 descriptor starts with ten digits followed by a newline, we parse it as a
 bandwidth file. (This is ugly, but the legacy bandwidth file format
 doesn't give us much of a choice.)

 This regular expression is broken. The regular expression we want is one
 that matches the first 100 characters of a descriptor, which ours didn't
 do.

 Suggested fix:

 {{{
 diff --git
 a/src/main/java/org/torproject/descriptor/impl/DescriptorParserImpl.java
 b/src/main/java/org/torproject/descriptor/impl/DescriptorParserImpl.java
 index 119fe09..08ac909 100644
 ---
 a/src/main/java/org/torproject/descriptor/impl/DescriptorParserImpl.java
 +++
 b/src/main/java/org/torproject/descriptor/impl/DescriptorParserImpl.java
 @@ -132,7 +132,7 @@ public class DescriptorParserImpl implements
 DescriptorParser {
            sourceFile);
      } else if (fileName.contains(LogDescriptorImpl.MARKER)) {
        return LogDescriptorImpl.parse(rawDescriptorBytes, sourceFile,
 fileName);
 -    } else if (firstLines.matches("^[0-9]{10}\\n")) {
 +    } else if (firstLines.matches("(?s)[0-9]{10}\\n.*")) {
        /* Identifying bandwidth files by a 10-digit timestamp in the first
 line
         * breaks with files generated before 2002 or after 2286 and when
 the next
         * descriptor identifier starts with just a timestamp in the first
 line
 }}}

 Explanation:

  - We don't need to start the pattern with `^`, because the regular
 expression needs to match the whole string anyway.
  - The `(?s)` part enables the dotall mode: ''"In dotall mode, the
 expression . matches any character, including a line terminator. By
 default this expression does not match line terminators. Dotall mode can
 also be enabled via the embedded flag expression (?s). (The s is a
 mnemonic for "single-line" mode, which is what this is called in Perl.)"''
  - We need to end the pattern with `.*` to match any characters following
 the first newline, which also includes newlines due to the previously
 enabled dotall mode.

 I'll create a branch for this in a minute.

--
Ticket URL: <https://trac.torproject.org/projects/tor/ticket/30369>
Tor Bug Tracker & Wiki <https://trac.torproject.org/>
The Tor Project: anonymity online


More information about the metrics-bugs mailing list