For the metrics tools there are some guidelines on this we can follow: https://docs.oracle.com/javase/tutorial/i18n/text/design.html. The other language would be Python (for stem), but Python developers have probably got a good understanding of unicode/str/bytes by now. (In Python 3: when using UTF-8, BOM will not be stripped and will be interpreted as data, and you can have a NUL in a str).
Hi Iain. Actually, for Stem I'm really looking forward to this too. Stem has special handling for the contact and platform fields (iirc the only spot non-ascii content can presently appear). Stem's parsers and API will be simplified once everything is uniformly utf-8. :P
Possibly a stupid question but any reason not to require the whole descriptor document to be printable characters?