[tor-dev] [prop-meeting] [prop#285] "Directory documents should be standardized as UTF-8"

teor teor2345 at gmail.com
Tue Feb 13 00:03:54 UTC 2018


> On 13 Feb 2018, at 10:55, isis agora lovecruft <isis at torproject.org> wrote:
> 
> A couple outcomes of this:
> 
> 1. What passes for "canonicalised" "utf-8" in C will be different to
>    what passes for "canonicalised" "utf-8" in Rust.  In C, the
>    following will not be allowed (whereas they are allowed in Rust):
>        - NUL (0x00)
>        - Byte Order Mark (0xFEFF)

I want to clarify this point:

The Byte Order Mark is Unicode Scalar 0xFEFF, encoded in UTF-8 as the
bytes 0xEF 0xBB 0xBF.

Tor's C and Rust implementations of UTF-8 must be identical.

When we write the C implementation, we must reject NUL for
compatibility with C string functions.

When we write the Rust implementation, we must reject NUL for
compatibility with the C implementation. (Rust already implements
UTF-8 strings that accept NUL, so this will require custom code).

When we write the C and Rust implementations, we must reject BOM
because it's unnecessary. Rejecting BOM is recommended by the
relevant standard. (Rust already implements UTF-8 strings that accept
BOM, so this will require custom code).

T


More information about the tor-dev mailing list