
Hi, On 12/02/18 23:55, isis agora lovecruft wrote:
1. What passes for "canonicalised" "utf-8" in C will be different to what passes for "canonicalised" "utf-8" in Rust. In C, the following will not be allowed (whereas they are allowed in Rust): - NUL (0x00) - Byte Order Mark (0xFEFF)
Much of the metrics software is written in Java. Java strings allow for NUL to appear, but assume that there is no BOM. If a BOM appears, then this would be interpreted as data and, I assume, parsing would probably fail. Should the whole document be rejected if it contains a NUL or BOM, or should these values be stripped and then carry on parsing as if it never happened?
2. Directory document keywords MUST be printable ASCII.
This can be validated. Should a single document keyword containing printable non-ASCII be enough to reject the document, or should a parser try to recover? I'd really like to see a section in the proposal about how parsers should react when they find something unexpected, otherwise all the parsers may end up doing different things.
3. This change may break some descriptor/consensus/document parsers. If you are the maintainer of a parser, you may want to start thinking about this now.
For the metrics tools there are some guidelines on this we can follow: https://docs.oracle.com/javase/tutorial/i18n/text/design.html. The other language would be Python (for stem), but Python developers have probably got a good understanding of unicode/str/bytes by now. (In Python 3: when using UTF-8, BOM will not be stripped and will be interpreted as data, and you can have a NUL in a str). Thanks, Iain.