[tor-dev] Proposal 285: Directory documents should be standardized as UTF-8

teor teor2345 at gmail.com
Wed Jan 10 00:19:54 UTC 2018



> On 10 Jan 2018, at 04:34, Nick Mathewson <nickm at alum.mit.edu> wrote:
> 
> On Mon, Nov 13, 2017 at 5:28 PM, teor <teor2345 at gmail.com> wrote:
>> On 14 Nov 2017, at 05:51, Nick Mathewson <nickm at torproject.org> wrote:
>> 
>> Filename: 285-utf-8.txt
>> Title: Directory documents should be standardized as UTF-8
>> Author: Nick Mathewson
>> Created: 13 November 2017
>> Status: Open
>> 
>> 1. Summary and motivation
>> 
>>   People frequently want to include non-ASCII text in their router
>>   descriptors.  The Contact line is a favorite place to do this, but in
>>   principle the platform line would also be pretty logical.
>> 
>>   Unfortunately, there's no specified way to encode non-ASCII in our
>>   directory documents.
>> 
>>   Fortunately, almost everybody who does it, uses UTF-8 anyway.
>> 
>> 
>> How many current descriptors will be rejected as non-UTF-8?
> 
> I think that when last I checked, the number was something like 3.
> 
>>   As we move towards Rust support in Tor, we gain another motivation
>>   for standarding on UTF-8, since Rust's native strings strongly prefer
>>   UTF-8.
>> 
>>   So, in this proposal, we describe a migration path to having all
>>   directory documents be fully UTF-8.
>> 
>> 2. Proposal
>> 
>>   First, we should have Tor relays reject ContactInfo lines (and any
>>   other lines copied directly into router descriptors) that are not
>>   UTF-8.
>> 
>> 
>> How do we define UTF-8?
> 
> I tried to do so as follows:
> 
>   We define the allowable set of UTF-8 as:
>        * Encoding the codepoints U+01 through U+10FFFF,
>        * but excluding the codepoints U+D800 through U+DFFF,

These are called "Unicode Scalar Values".
https://www.unicode.org/glossary/#unicode_scalar_value

Let's reference that.

>        * each encoded with the shortest possible encoding.
>        * without any BOM
> 
> Are there other restrictions we should make?  If so, how should we phrase them?

These seem fine, and not tied to a particular unicode version.

But I don't know enough about Unicode to know if there is anything else we should
specify.

I know how we'd do this in C (raw bytes with a check before parsing), and I think
we can do this in Rust using char:
https://doc.rust-lang.org/1.0.0/unicode/char/


Here are some other things we might want to document:

Unassigned Code Points

Accepting arbitrary unassigned unicode code points may cause issues for some
parsers, because as far as I am aware, parsers typically only handle a particular
unicode version. We should note this in the spec.

The potential attack here is that Tor accepts a newly introduced character, and
a downstream parser rejects it. But that's not Tor's problem.

The right way for parsers to handle this is to replace unknown characters with
an appropriate replacement character. (Unicode has rules for this.) Or possibly
throw an error. We can't make this decision for them: it depends on the goals of
the parser.

Equality and Normalisation

We should also make sure that equality is specified as byte-for-byte equality.
This means that several different byte sequences could be visually similar, and
even have identical normalised forms, but we would treat them as different.

Unicode has several levels of normalisation:
https://en.wikipedia.org/wiki/Unicode_equivalence#Normalization
We should not require any of them in our inputs.

Again, normalisation may be a potential issue for parsers. Again, we can't
decide how they will want to handle it, but we should document it.

Also, if we change our minds about this in future, we can make tor relays
normalise the contents of their descriptors, and the authority implementation
will continue to work. And then we can make authorities reject non-normalised
inputs a few releases later.

> [...]
>> How do we carry forward existing ASCII restrictions into UTF-8?
> 
> I don't understand this question.

I think it was intended as a general question.
Then I wrote some specific questions.

>> We will need to update the directory spec to acknowledge that
>> contact and platform lines may be parsed as UTF-8 or
>> ASCII-including-arbitrary-bytes-except-NUL, and that they are
>> terminated by single-byte newlines regardless.
> 
> Ack.
> 
>> How do we deal with format confusion attacks?
>> 
>> UTF-8 has a few alternative whitespace characters. These could
>> be used in an attack that confuses either humans viewing the file,
>> or automated software:
>> 
>> If a human uses a UTF-8 compatible viewer or editor, it likely shows
>> Unicode newlines and ASCII newlines in an identical way. Similarly,
>> it may show Unicode spaces and ASCII spaces in the same way.
>> This may confuse the human reader.
> 
> Right.  I don't see an obvious attack here, but we should keep it in mind.
> 
> Do you have a different suggestion of what to do here?

No, I really think this is like the potential parser bugs: not our problem.
People should get a better editor. And editors should get better.

>> Similarly, if automated software parses using a Unicode whitespace
>> or newline character class, it will mis-parse directory documents.
>> (Our Rust protover code looks for ASCII spaces, so it appears to
>> be fine.)
>> 
>> Note that we already have this issue with line feeds and carriage
>> returns, which I thought we had solved by banning carriage returns
>> in directory documents. But it appears we allow "any printing ASCII
>> character". (We will have to edit this to include Unicode.)
> 
> Also let's consider all the nonprinting ASCII: it's already a
> potential display problem if you're using a bad editor, or whatever.

Yes. Just like we can't decide how editors or parsers handle bad ASCII,
we can't decide how they handle bad (or new) Unicode.

T

--
Tim Wilson-Brown (teor)

teor2345 at gmail dot com
PGP C855 6CED 5D90 A0C5 29F6 4D43 450C BA7F 968F 094B
ricochet:ekmygaiu4rzgsk6n
------------------------------------------------------------------------




-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 833 bytes
Desc: Message signed with OpenPGP
URL: <http://lists.torproject.org/pipermail/tor-dev/attachments/20180110/4df72cd5/attachment.sig>


More information about the tor-dev mailing list