On 14 Feb 2018, at 11:03, Damian Johnson atagar@torproject.org wrote:
For the metrics tools there are some guidelines on this we can follow: https://docs.oracle.com/javase/tutorial/i18n/text/design.html. The other language would be Python (for stem), but Python developers have probably got a good understanding of unicode/str/bytes by now. (In Python 3: when using UTF-8, BOM will not be stripped and will be interpreted as data, and you can have a NUL in a str).
Hi Iain. Actually, for Stem I'm really looking forward to this too. Stem has special handling for the contact and platform fields (iirc the only spot non-ascii content can presently appear). Stem's parsers and API will be simplified once everything is uniformly utf-8. :P
Possibly a stupid question but any reason not to require the whole descriptor document to be printable characters?
Requiring printable ASCII throughout the document means that people can't spell their names and email addresses correctly in contact lines.
Requiring printable unicode introduces a dependency on a particular unicode version, because we don't know if unallocated blocks will be printable or not.
I think we could make platform lines printable ASCII without losing much. Unless there are platforms that have non-ASCII names?
T
-- Tim Wilson-Brown (teor)
teor2345 at gmail dot com PGP C855 6CED 5D90 A0C5 29F6 4D43 450C BA7F 968F 094B ricochet:ekmygaiu4rzgsk6n ------------------------------------------------------------------------