Re: [tor-dev] [prop-meeting] [prop#285] "Directory documents should be standardized as UTF-8"

13 Feb 2018

      Hi,

On 12/02/18 23:55, isis agora lovecruft wrote:
...
1. What passes for "canonicalised" "utf-8" in C will be different to
    what passes for "canonicalised" "utf-8" in Rust.  In C, the
    following will not be allowed (whereas they are allowed in Rust):
        - NUL (0x00)
        - Byte Order Mark (0xFEFF)
Much of the metrics software is written in Java. Java strings allow for
NUL to appear, but assume that there is no BOM. If a BOM appears, then
this would be interpreted as data and, I assume, parsing would probably
fail. Should the whole document be rejected if it contains a NUL or BOM,
or should these values be stripped and then carry on parsing as if it
never happened?
...
2. Directory document keywords MUST be printable ASCII.
This can be validated. Should a single document keyword containing
printable non-ASCII be enough to reject the document, or should a parser
try to recover?

I'd really like to see a section in the proposal about how parsers
should react when they find something unexpected, otherwise all the
parsers may end up doing different things.
...
3. This change may break some descriptor/consensus/document parsers.
    If you are the maintainer of a parser, you may want to start
    thinking about this now.
For the metrics tools there are some guidelines on this we can follow:
https://docs.oracle.com/javase/tutorial/i18n/text/design.html. The other
language would be Python (for stem), but Python developers have probably
got a good understanding of unicode/str/bytes by now. (In Python 3: when
using UTF-8, BOM will not be stripped and will be interpreted as data,
and you can have a NUL in a str).

Thanks,
Iain.

Re: [tor-dev] [prop-meeting] [prop#285] "Directory documents should be standardized as UTF-8"

Iain Learmonth