<html><head><meta http-equiv="content-type" content="text/html; charset=utf-8"></head><body dir="auto"><div><span></span></div><div><meta http-equiv="content-type" content="text/html; charset=utf-8"><div><span></span></div><div><div>On 14 Nov 2017, at 05:51, Nick Mathewson <<a href="mailto:nickm@torproject.org">nickm@torproject.org</a>> wrote:<br><br></div><blockquote type="cite"><div><div dir="ltr"><div>Filename: 285-utf-8.txt</div><div>Title: Directory documents should be standardized as UTF-8</div><div>Author: Nick Mathewson</div><div>Created: 13 November 2017</div><div>Status: Open</div><div><br></div><div>1. Summary and motivation</div><div><br></div><div>   People frequently want to include non-ASCII text in their router</div><div>   descriptors.  The Contact line is a favorite place to do this, but in</div><div>   principle the platform line would also be pretty logical.</div><div><br></div><div>   Unfortunately, there's no specified way to encode non-ASCII in our</div><div>   directory documents.</div><div><br></div><div>   Fortunately, almost everybody who does it, uses UTF-8 anyway.</div></div></div></blockquote><div><br></div><div>How many current descriptors will be rejected as non-UTF-8?</div><br><blockquote type="cite"><div><div dir="ltr"><div>   As we move towards Rust support in Tor, we gain another motivation</div><div>   for standarding on UTF-8, since Rust's native strings strongly prefer</div><div>   UTF-8.</div><div><br></div><div>   So, in this proposal, we describe a migration path to having all</div><div>   directory documents be fully UTF-8.</div><div><br></div><div>2. Proposal</div><div><br></div><div>   First, we should have Tor relays reject ContactInfo lines (and any</div><div>   other lines copied directly into router descriptors) that are not</div><div>   UTF-8.</div></div></div></blockquote><div><br></div>How do we define UTF-8?</div><div><br></div><div>Do we exclude all invalid byte sequences?</div><div>Do we exclude all invalid code points (some libraries don't)?</div><div><a href="https://en.m.wikipedia.org/wiki/UTF-8#Invalid_byte_sequences">https://en.m.wikipedia.org/wiki/UTF-8#Invalid_byte_sequences</a></div><div><br></div><div>Do we reject unassigned or reserved code points?</div><div>Do we reject private use code points?</div><div><a href="https://en.m.wikipedia.org/wiki/Unicode#General_Category_property">https://en.m.wikipedia.org/wiki/Unicode#General_Category_property</a></div><div><br></div><div><span style="background-color: rgba(255, 255, 255, 0);">How do we avoid tying ourselves to a particular version of Unicode?</span></div><div><span style="background-color: rgba(255, 255, 255, 0);">(By accepting reserved code points? Some libraries don't do this.)</span></div><div><span style="background-color: rgba(255, 255, 255, 0);"><br></span></div><div><span style="background-color: rgba(255, 255, 255, 0);">Will we allow a byte order mark?</span></div><div><span style="background-color: rgba(255, 255, 255, 0);">(We can't during the transition,</span><span style="background-color: rgba(255, 255, 255, 0);"> it doesn't parse as ASCII.</span></div><div>And we probably shouldn't for any verbatim lines, because they</div><div>are copied into the middle of the descriptor.)</div><div><div><br></div>How do we carry forward existing ASCII restrictions into UTF-8?<div><br></div><div>We will need to update the directory spec to acknowledge that</div><div>contact and platform lines may be parsed as UTF-8 or</div><div>ASCII-including-arbitrary-bytes-except-NUL, and that they are</div><div>terminated by single-byte newlines regardless.</div><div><br></div><div>How do we deal with format confusion attacks?</div><div><br></div><div>UTF-8 has a few alternative whitespace characters. These could</div><div>be used in an attack that confuses either humans viewing the file,</div><div>or automated software:</div><div><br></div><div>If a human uses a UTF-8 compatible viewer or editor, it likely shows</div><div>Unicode newlines and ASCII newlines in an identical way. Similarly,</div><div>it may show Unicode spaces and ASCII spaces in the same way.</div><div>This may confuse the human reader.</div><div><br></div><div>Similarly, if automated software parses using a Unicode whitespace</div><div>or newline character class, it will mis-parse directory documents.</div><div>(Our Rust protover code looks for ASCII spaces, so it appears to</div><div>be fine.)</div><div><br></div><div>Note that we already have this issue with line feeds and carriage</div><div>returns, which I thought we had solved by banning carriage returns</div><div>in directory documents. But it appears we allow "any printing ASCII</div><div>character". (We will have to edit this to include Unicode.)</div><div><br></div><div><a href="https://gitweb.torproject.org/torspec.git/tree/dir-spec.txt#n218">https://gitweb.torproject.org/torspec.git/tree/dir-spec.txt#n218</a></div><div><br></div><blockquote type="cite"><div dir="ltr"><div>   At the same time, we should have authorities reject any router</div><div>   descriptors or extrainfo documents that are not valid UTF-8.</div><div>   Simultaneously, we can have all Tor instances reject all</div><div>   non-directory-descriptor directory documents that are not UTF-8,</div><div>   since none should exist today.</div></div></blockquote><div><br></div><div>If we apply the existing restrictions in dir-spec, which require</div><div><span style="background-color: rgba(255, 255, 255, 0);">non-directory-descriptor directory </span>documents to be ASCII, they will</div><div>also be UTF-8.</div><div><br></div><div>Isn't it confusing to say "UTF-8", when what we really mean is "ASCII"?</div><div>Do we expect to migrate these to non-ASCII UTF-8 at some point?</div><div><br></div><div>Also, does "<span style="background-color: rgba(255, 255, 255, 0);">non-directory-descriptor directory documents" mean we</span></div><div><span style="background-color: rgba(255, 255, 255, 0);">can reject non-UTF-8 </span><span style="background-color: rgba(255, 255, 255, 0);">microdescriptors? I think we should.</span></div><div><span style="background-color: rgba(255, 255, 255, 0);"><br></span></div><div><span style="background-color: rgba(255, 255, 255, 0);">Does the NS consensus contain any lines that are copied verbatim from</span></div><div><span style="background-color: rgba(255, 255, 255, 0);">descriptors?</span></div><br><blockquote type="cite"><div dir="ltr"><div>   Finally, once the authorities have updated, we should have all Tor</div><div>   instances reject all directory documents that are not UTF-8.  (We</div><div>   should not take this step until the authorities have upgraded, or</div><div>   else the behavior of updated and non-updated clients could be</div><div>   distinguished.)</div><div><br></div><div>2.1. Hidden service descriptors' encrypted bodies</div><div><br></div><div>   For the encrypted bodies of hidden service descriptors, we cannot</div><div>   reject them at the authority level, and so we need to take a slightly</div><div>   different approach to prevent client fingerprinting attacks.</div><div><br></div><div>   First, we should make Tor instances start warning about any hidden</div><div>   service descriptors whose bodies, post-decryption, contain non-utf-8</div><div>   plaintext.  At the same time, we add a consensus parameter to</div><div>   indicate that hidden service descriptors with non-utf-8 plantexts</div></div></blockquote><div><br></div><div>typo: plaintexts </div><br><blockquote type="cite"><div dir="ltr"><div>   should be rejected entirely: "reject-encrypted-non-utf-8".  If that</div><div>   parameter is set to 1, then hidden service clients will not only</div><div>   warn, but reject the descriptors.</div><div><br></div><div>   Once the vast majority of clients are running versions that support</div><div>   the "reject-encrypted-non-utf-8" parameter, that parameter can be set</div><div>   to 1.</div></div></blockquote><br></div></div><div>We also can't reject bridge descriptors at the authority level.</div><div>(Bridge clients download bridge descriptors directly from bridges.)</div><div>Do we need bridge clients to also use this consensus parameter?</div><div><br></div><div>T</div></body></html>