[tor-bugs] #21933 [Metrics/Onionoo]: Fix deserialization of UTF-8 characters in details statuses and documents

Thu Apr 13 09:45:04 UTC 2017

#21933: Fix deserialization of UTF-8 characters in details statuses and documents
---------------------------------+--------------------------
     Reporter:  karsten          |      Owner:  metrics-team
         Type:  defect           |     Status:  new
     Priority:  Medium           |  Milestone:
    Component:  Metrics/Onionoo  |    Version:
     Severity:  Normal           |   Keywords:
Actual Points:                   |  Parent ID:
       Points:                   |   Reviewer:
      Sponsor:                   |
---------------------------------+--------------------------
 While looking into the encoding issue of different Onionoo instances
 producing different contact string encodings (#15813), I tracked down a
 somewhat related issue with UTF-8 characters all being converted to `?`.

 The issue is related to how we're avoiding to store UTF-8 characters in
 details statuses and details documents and instead escaping those
 characters.  We're doing this correctly for the serialization part but
 incorrectly for the deserialization part.

 We have two choices here.  We could either give up on the escaping part
 and just store UTF-8 characters directly.  Or we could fix the
 deserialization part.  I have a fix for the latter and ran out of time for
 the former, but maybe the former would be the better fix.  I'm including
 my fix here anyway:

 {{{
 diff --git a/src/main/java/org/torproject/onionoo/docs/DocumentStore.java
 b/src/main/java/org/torproject/onionoo/docs/DocumentStore.java
 index 39d6271..b6b1c4c 100644
 --- a/src/main/java/org/torproject/onionoo/docs/DocumentStore.java
 +++ b/src/main/java/org/torproject/onionoo/docs/DocumentStore.java
 @@ -496,22 +496,25 @@ public class DocumentStore {
      if (!parse) {
        return this.retrieveUnparsedDocumentFile(documentType,
            documentString);
 -    } else if (documentType.equals(DetailsDocument.class)
 -        || documentType.equals(BandwidthDocument.class)
 +    } else if (documentType.equals(BandwidthDocument.class)
          || documentType.equals(WeightsDocument.class)
          || documentType.equals(ClientsDocument.class)
          || documentType.equals(UptimeDocument.class)) {
        return this.retrieveParsedDocumentFile(documentType,
            documentString);
 +    } else if (documentType.equals(DetailsStatus.class)
 +        || documentType.equals(DetailsDocument.class)) {
 +      if (documentType.equals(DetailsStatus.class)) {
 +        documentString = "{" + documentString + "}";
 +      }
 +      documentString = StringUtils.replace(documentString, "\\u",
 "\\\\u");
 +      return this.retrieveParsedDocumentFile(documentType,
 documentString);
      } else if (documentType.equals(BandwidthStatus.class)
          || documentType.equals(WeightsStatus.class)
          || documentType.equals(ClientsStatus.class)
          || documentType.equals(UptimeStatus.class)
          || documentType.equals(UpdateStatus.class)) {
        return this.retrieveParsedStatusFile(documentType, documentString);
 -    } else if (documentType.equals(DetailsStatus.class)) {
 -      return this.retrieveParsedDocumentFile(documentType, "{"
 -          + documentString + "}");
      } else {
        log.error("Parsing is not supported for type "
            + documentType.getName() + ".");
 }}}

 The main difference here is the `StringUtils.replace()` part.  Without
 this line (so, current master) we would pass a string containing `\uxxxx`
 to `Gson.fromJson()` which would unescape it and turn it into the
 corresponding UTF-8 character.  So far so good.  But when we would later
 write this status or document back to disk, `DocumentStore#writeToFile`
 will write these bytes to disk as `US-ASCII`, and that will replace all
 UTF-8 characters with `?`.

 The patch fixes this by replacing `\uxxxx` in the file content with
 `\\uxxxx` which `Gson.fromJson()` will not consider an escaped UTF-8
 character.  We do have code in place that reverses this double-escaping,
 see `DetailsStatus#unescapeJson`.  So, the patch fixes the problem by
 keeping things escaped until they are used.

 Again, I think the cleaner fix would be to give up on escaping UTF-8
 characters and just switch to UTF-8.  The part that might make this a
 little harder is that we'll have to make sure that this works correctly
 for existing files.  And if it does not, we'll need to special-case those
 somehow.  But maybe the patch above helps to come up with this cleaner
 fix.

--
Ticket URL: <https://trac.torproject.org/projects/tor/ticket/21933>
Tor Bug Tracker & Wiki <https://trac.torproject.org/>
The Tor Project: anonymity online