Re: [tor-dev] [prop-meeting] [prop#285] "Directory documents should be standardized as UTF-8"

13 Feb 2018

...
On 13 Feb 2018, at 10:55, isis agora lovecruft <isis@torproject.org> wrote:
A couple outcomes of this:
1. What passes for "canonicalised" "utf-8" in C will be different to
   what passes for "canonicalised" "utf-8" in Rust.  In C, the
   following will not be allowed (whereas they are allowed in Rust):
       - NUL (0x00)
       - Byte Order Mark (0xFEFF)
I want to clarify this point:

The Byte Order Mark is Unicode Scalar 0xFEFF, encoded in UTF-8 as the
bytes 0xEF 0xBB 0xBF.

Tor's C and Rust implementations of UTF-8 must be identical.

When we write the C implementation, we must reject NUL for
compatibility with C string functions.

When we write the Rust implementation, we must reject NUL for
compatibility with the C implementation. (Rust already implements
UTF-8 strings that accept NUL, so this will require custom code).

When we write the C and Rust implementations, we must reject BOM
because it's unnecessary. Rejecting BOM is recommended by the
relevant standard. (Rust already implements UTF-8 strings that accept
BOM, so this will require custom code).

T

Re: [tor-dev] [prop-meeting] [prop#285] "Directory documents should be standardized as UTF-8"

teor