[tor-bugs] #18938 [Core Tor/Tor]: Authorities should reject non-ASCII content in ExtraInfo descriptors

Mon Jul 11 00:18:57 UTC 2016

#18938: Authorities should reject non-ASCII content in ExtraInfo descriptors
----------------------------------+------------------------------------
 Reporter:  teor                  |          Owner:
     Type:  defect                |         Status:  new
 Priority:  Medium                |      Milestone:  Tor: 0.2.9.x-final
Component:  Core Tor/Tor          |        Version:
 Severity:  Normal                |     Resolution:
 Keywords:  needs-proposal-maybe  |  Actual Points:
Parent ID:  #18656                |         Points:  1
 Reviewer:                        |        Sponsor:
----------------------------------+------------------------------------

Comment (by teor):

 Replying to [comment:16 cypherpunks]:
 > As one of the people with non-ascii ContactInfo, I strongly advise
 against making that config ascii-only. It might not be obvious to english-
 native speakers, but in countries with non-ascii characters in their
 language the introduction of IDN and non-ascii mail addresses was a major
 advance; restricting this would be a step backward, which will probably
 need to be corrected again in the future when non-ascii mail addresses
 become more ubiquitous.
 >
 > I would prefer for all UTF-8 chars to be usable in the ContactInfo,
 which also allows to not have to transliterate your name into ascii.

 Currently, the Tor ContactInfo and Platform consist of arbitrary binary
 data, terminated by an ASCII linefeed byte.

 There's no indication of how they should be interpreted - whether they're
 a particular extended-ASCII codepage, or UTF-8, or something else.

 If the ContactInfo and Platform are UTF-8, it's entirely safe to parse the
 entire file as UTF-8, then restrict all other lines to ASCII. It's also
 entirely safe to parse the file as ASCII, except for the ContactInfo and
 Platform, which can be any bytes except ASCII LF. (UTF-8 encodes 0-127 as
 0-127, and never maps any other characters to bytes 0-127.)

 Some encodings for the ContactInfo and Platform may even produce linefeed
 bytes, which is clearly unsuitable.

 I think we have 3 options:

 1. We could specify validate that ContactInfo and Platform are valid UTF-8
 instead. But I'd hate to have to import a changing series of Unicode
 libraries to do this. Or specify a particular Unicode version. Or deal
 with the character ambiguities or parser security risks Unicode entails.
 (Yes, there are attacks on Unicode parsers - remember the iPhone emoji
 bug?)

 2. We could remain with the current spec, which is under-specified, and
 leave them as arbitrary, unspecified-encoding bytes. But this is not ideal
 - how can a relay operator be contacted, when the encoding of their
 address is unclear?

 3. We could require relay operators to have an ASCII email address (it
 could be another account, an alias, a transliteration, or an IDN-ASCII-
 encoding). Which means that there's no encoding ambiguity, and people
 whose descriptor-viewing or mail programs don't understand UTF-8 can still
 email operators. It's onerous for those whose names are not ASCII, but so
 is the risk of being uncontactable via non-Unicode descriptor readers
 and/or mail programs.

 What do you think, cypherpunks?

--
Ticket URL: <https://trac.torproject.org/projects/tor/ticket/18938#comment:17>
Tor Bug Tracker & Wiki <https://trac.torproject.org/>
The Tor Project: anonymity online