
On 13 Feb 2018, at 10:55, isis agora lovecruft <isis@torproject.org> wrote:
A couple outcomes of this:
1. What passes for "canonicalised" "utf-8" in C will be different to what passes for "canonicalised" "utf-8" in Rust. In C, the following will not be allowed (whereas they are allowed in Rust): - NUL (0x00) - Byte Order Mark (0xFEFF)
I want to clarify this point: The Byte Order Mark is Unicode Scalar 0xFEFF, encoded in UTF-8 as the bytes 0xEF 0xBB 0xBF. Tor's C and Rust implementations of UTF-8 must be identical. When we write the C implementation, we must reject NUL for compatibility with C string functions. When we write the Rust implementation, we must reject NUL for compatibility with the C implementation. (Rust already implements UTF-8 strings that accept NUL, so this will require custom code). When we write the C and Rust implementations, we must reject BOM because it's unnecessary. Rejecting BOM is recommended by the relevant standard. (Rust already implements UTF-8 strings that accept BOM, so this will require custom code). T