[tor-dev] Proposal 285: Directory documents should be standardized as UTF-8

teor teor2345 at gmail.com
Mon Nov 13 22:28:44 UTC 2017


> On 14 Nov 2017, at 05:51, Nick Mathewson <nickm at torproject.org> wrote:
> 
> Filename: 285-utf-8.txt
> Title: Directory documents should be standardized as UTF-8
> Author: Nick Mathewson
> Created: 13 November 2017
> Status: Open
> 
> 1. Summary and motivation
> 
>    People frequently want to include non-ASCII text in their router
>    descriptors.  The Contact line is a favorite place to do this, but in
>    principle the platform line would also be pretty logical.
> 
>    Unfortunately, there's no specified way to encode non-ASCII in our
>    directory documents.
> 
>    Fortunately, almost everybody who does it, uses UTF-8 anyway.

How many current descriptors will be rejected as non-UTF-8?

>    As we move towards Rust support in Tor, we gain another motivation
>    for standarding on UTF-8, since Rust's native strings strongly prefer
>    UTF-8.
> 
>    So, in this proposal, we describe a migration path to having all
>    directory documents be fully UTF-8.
> 
> 2. Proposal
> 
>    First, we should have Tor relays reject ContactInfo lines (and any
>    other lines copied directly into router descriptors) that are not
>    UTF-8.

How do we define UTF-8?

Do we exclude all invalid byte sequences?
Do we exclude all invalid code points (some libraries don't)?
https://en.m.wikipedia.org/wiki/UTF-8#Invalid_byte_sequences

Do we reject unassigned or reserved code points?
Do we reject private use code points?
https://en.m.wikipedia.org/wiki/Unicode#General_Category_property

How do we avoid tying ourselves to a particular version of Unicode?
(By accepting reserved code points? Some libraries don't do this.)

Will we allow a byte order mark?
(We can't during the transition, it doesn't parse as ASCII.
And we probably shouldn't for any verbatim lines, because they
are copied into the middle of the descriptor.)

How do we carry forward existing ASCII restrictions into UTF-8?

We will need to update the directory spec to acknowledge that
contact and platform lines may be parsed as UTF-8 or
ASCII-including-arbitrary-bytes-except-NUL, and that they are
terminated by single-byte newlines regardless.

How do we deal with format confusion attacks?

UTF-8 has a few alternative whitespace characters. These could
be used in an attack that confuses either humans viewing the file,
or automated software:

If a human uses a UTF-8 compatible viewer or editor, it likely shows
Unicode newlines and ASCII newlines in an identical way. Similarly,
it may show Unicode spaces and ASCII spaces in the same way.
This may confuse the human reader.

Similarly, if automated software parses using a Unicode whitespace
or newline character class, it will mis-parse directory documents.
(Our Rust protover code looks for ASCII spaces, so it appears to
be fine.)

Note that we already have this issue with line feeds and carriage
returns, which I thought we had solved by banning carriage returns
in directory documents. But it appears we allow "any printing ASCII
character". (We will have to edit this to include Unicode.)

https://gitweb.torproject.org/torspec.git/tree/dir-spec.txt#n218

>    At the same time, we should have authorities reject any router
>    descriptors or extrainfo documents that are not valid UTF-8.
>    Simultaneously, we can have all Tor instances reject all
>    non-directory-descriptor directory documents that are not UTF-8,
>    since none should exist today.

If we apply the existing restrictions in dir-spec, which require
non-directory-descriptor directory documents to be ASCII, they will
also be UTF-8.

Isn't it confusing to say "UTF-8", when what we really mean is "ASCII"?
Do we expect to migrate these to non-ASCII UTF-8 at some point?

Also, does "non-directory-descriptor directory documents" mean we
can reject non-UTF-8 microdescriptors? I think we should.

Does the NS consensus contain any lines that are copied verbatim from
descriptors?

>    Finally, once the authorities have updated, we should have all Tor
>    instances reject all directory documents that are not UTF-8.  (We
>    should not take this step until the authorities have upgraded, or
>    else the behavior of updated and non-updated clients could be
>    distinguished.)
> 
> 2.1. Hidden service descriptors' encrypted bodies
> 
>    For the encrypted bodies of hidden service descriptors, we cannot
>    reject them at the authority level, and so we need to take a slightly
>    different approach to prevent client fingerprinting attacks.
> 
>    First, we should make Tor instances start warning about any hidden
>    service descriptors whose bodies, post-decryption, contain non-utf-8
>    plaintext.  At the same time, we add a consensus parameter to
>    indicate that hidden service descriptors with non-utf-8 plantexts

typo: plaintexts 

>    should be rejected entirely: "reject-encrypted-non-utf-8".  If that
>    parameter is set to 1, then hidden service clients will not only
>    warn, but reject the descriptors.
> 
>    Once the vast majority of clients are running versions that support
>    the "reject-encrypted-non-utf-8" parameter, that parameter can be set
>    to 1.

We also can't reject bridge descriptors at the authority level.
(Bridge clients download bridge descriptors directly from bridges.)
Do we need bridge clients to also use this consensus parameter?

T
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.torproject.org/pipermail/tor-dev/attachments/20171114/d9640e03/attachment-0001.html>


More information about the tor-dev mailing list