<html><head><meta http-equiv="content-type" content="text/html; charset=utf-8"></head><body dir="auto"><div></div><div><br></div><div>On 13 Feb 2018, at 21:55, Iain Learmonth <<a href="mailto:irl@torproject.org">irl@torproject.org</a>> wrote:<br><br></div><blockquote type="cite"><div><span>Hi,</span><br><span></span><br><span>On 12/02/18 23:55, isis agora lovecruft wrote:</span><br><blockquote type="cite"><span> 1. What passes for "canonicalised" "utf-8" in C will be different to</span><br></blockquote><blockquote type="cite"><span>    what passes for "canonicalised" "utf-8" in Rust.  In C, the</span><br></blockquote><blockquote type="cite"><span>    following will not be allowed (whereas they are allowed in Rust):</span><br></blockquote><blockquote type="cite"><span>        - NUL (0x00)</span><br></blockquote><blockquote type="cite"><span>        - Byte Order Mark (0xFEFF)</span><br></blockquote><span></span><br><span>Much of the metrics software is written in Java. Java strings allow for</span><br><span>NUL to appear, but assume that there is no BOM. If a BOM appears, then</span><br><span>this would be interpreted as data and, I assume, parsing would probably</span><br><span>fail. Should the whole document be rejected if it contains a NUL or BOM,</span><br><span>or should these values be stripped and then carry on parsing as if it</span><br><span>never happened?</span><br></div></blockquote><div><br></div><div>Directory authorities and bridge clients already reject descriptors that</div><div>contain NUL. (This is an artefact of the C implementation: the descriptor</div><div>is seen as truncated, so it won't parse.)</div><div><br></div><div>We should specify rejection for BOM as well.</div><br><blockquote type="cite"><div><blockquote type="cite"><span> 2. Directory document keywords MUST be printable ASCII.</span><br></blockquote><span></span><br><span>This can be validated. Should a single document keyword containing</span><br><span>printable non-ASCII be enough to reject the document, or should a parser</span><br><span>try to recover?</span><br></div></blockquote><div><br></div><div>If parsers want to be consistent with the Tor implementation, they should</div><div>reject.</div><br><blockquote type="cite"><div><span>I'd really like to see a section in the proposal about how parsers</span><br><span>should react when they find something unexpected, otherwise all the</span><br><span>parsers may end up doing different things.</span><br></div></blockquote><div><br></div><div>+1</div><br><blockquote type="cite"><div><blockquote type="cite"><span> 3. This change may break some descriptor/consensus/document parsers.</span><br></blockquote><blockquote type="cite"><span>    If you are the maintainer of a parser, you may want to start</span><br></blockquote><blockquote type="cite"><span>    thinking about this now.</span><br></blockquote><span></span><br><span>For the metrics tools there are some guidelines on this we can follow:</span><br><span><a href="https://docs.oracle.com/javase/tutorial/i18n/text/design.html">https://docs.oracle.com/javase/tutorial/i18n/text/design.html</a>. The other</span><br><span>language would be Python (for stem), but Python developers have probably</span><br><span>got a good understanding of unicode/str/bytes by now. (In Python 3: when</span><br><span>using UTF-8, BOM will not be stripped and will be interpreted as data,</span><br><span>and you can have a NUL in a str).</span><br></div></blockquote><div><br></div>Python for txtorcon<div>Rust for Tor's experimental protover implementation</div><div><br></div><div>And perhaps others:</div><div><a href="https://stem.torproject.org/faq.html#are-there-any-other-controller-libraries">https://stem.torproject.org/faq.html#are-there-any-other-controller-libraries</a></div><div><a href="https://trac.torproject.org/projects/tor/wiki/doc/ListOfTorImplementations">https://trac.torproject.org/projects/tor/wiki/doc/ListOfTorImplementations</a><br><br><div>T</div></div></body></html>