On 13 Feb 2018, at 10:55, isis agora lovecruft isis@torproject.org wrote:
A couple outcomes of this:
- What passes for "canonicalised" "utf-8" in C will be different to what passes for "canonicalised" "utf-8" in Rust. In C, the following will not be allowed (whereas they are allowed in Rust): - NUL (0x00) - Byte Order Mark (0xFEFF)
I want to clarify this point:
The Byte Order Mark is Unicode Scalar 0xFEFF, encoded in UTF-8 as the bytes 0xEF 0xBB 0xBF.
Tor's C and Rust implementations of UTF-8 must be identical.
When we write the C implementation, we must reject NUL for compatibility with C string functions.
When we write the Rust implementation, we must reject NUL for compatibility with the C implementation. (Rust already implements UTF-8 strings that accept NUL, so this will require custom code).
When we write the C and Rust implementations, we must reject BOM because it's unnecessary. Rejecting BOM is recommended by the relevant standard. (Rust already implements UTF-8 strings that accept BOM, so this will require custom code).
T