[tor-bugs] #10703 [TorBrowserButton]: Fallback charset enables fingerprinting of bundle localization

Thu Jan 23 05:58:25 UTC 2014

#10703: Fallback charset enables fingerprinting of bundle localization
------------------------------+---------------------------
 Reporter:  dcf               |          Owner:  mikeperry
     Type:  defect            |         Status:  new
 Priority:  normal            |      Milestone:
Component:  TorBrowserButton  |        Version:
 Keywords:  tbb-fingerprints  |  Actual Points:
Parent ID:                    |         Points:
------------------------------+---------------------------
 Torbutton has the `spoof_english` pref that changes the value of the
 `Accept-Language` header to `en-us,en;q=0.5`; this cloaks what particular
 localized bundle you may be using. But localized bundles still differ in
 their default (fallback) charset. By figuring out what characters a byte
 sequence decodes as, it's possible to find out what charset is in use.

 The attack goes like this. The web server sends an HTML page with no
 declared charset, neither in the HTTP header (`Content-Type`) nor in the
 HTML (`<meta charset=...`>). The HTML contains one or more byte sequences
 that stand for different characters in different charsets. JavaScript in
 the HTML measures the size of the rendered characters. By including a few
 different byte sequences, it's probably possible to fingerprint all the
 possible TBB localizations.

 It looks like our current bundles may come with any of 6 different default
 charsets:
  * [https://en.wikipedia.org/wiki/UTF-8 utf-8]: ar fa
  * [https://en.wikipedia.org/wiki/ISO/IEC_8859-1 iso-8859-1]: de es-ES fr
 it nl pt-PT vi
  * [https://en.wikipedia.org/wiki/ISO/IEC_8859-2 iso-8859-2]: pl
  * [https://en.wikipedia.org/wiki/Windows-1251 windows-1251]: ru
  * [https://en.wikipedia.org/wiki/EUC-KR#EUC-KR euc-kr]: ko
  * [https://en.wikipedia.org/wiki/GBK gbk]: zh
 I found these by grepping the langpacks' unpacked `*.xpi` files for
 "[http://kb.mozillazine.org/Firefox_:_FAQs_:_About:config_Entries#Intl.
 intl.charset.default]".

 As an example of how byte sequences can be variously decoded, here are
 decodings of "\xc3\xa3":
  * utf-8: ã
  * iso-8859-1: Ã£
  * iso-8859-2: ĂŁ
  * windows-1251: ГЈ
  * euc-kr: 찾
  * gbk: 茫
 That is, an HTML page can contain the sequence "\xc3\xa3" and it will
 render as different characters depending on the charset in effect.

 A possible solution is just to force intl.charset.default to UTF-8 in all
 localizations. Here are some Mozilla bugs I found that are relevant to
 setting this pref to UTF-8:
 [https://bugzilla.mozilla.org/show_bug.cgi?id=910165 910165]
 [https://bugzilla.mozilla.org/show_bug.cgi?id=406498 406498]
 [https://bugzilla.mozilla.org/show_bug.cgi?id=536506 536506]
 [https://bugzilla.mozilla.org/show_bug.cgi?id=910169 910169].

 Also see https://developer.mozilla.org/en-
 US/docs/Localizations_and_character_encodings#Specifying_the_fallback_encoding,
 which indicates that Firefox's behavior with respect to the fallback
 charset will change:
 > As of Firefox 28, this section is obsolete, since the preference
 intl.charset.default no longer exists. The mapping from locales onto
 fallback encodings is now built into Gecko itself.
 In the best case, this could be interpreted to mean that the
 `spoof_english` setting will become sufficient, and the fallback will
 become as it would be for en-US. Or it might just mean that the preference
 is moved to somewhere inside Gecko. It seems the relevant bug is
 [https://bugzilla.mozilla.org/show_bug.cgi?id=910192 910192: Get rid of
 intl.charset.default as a localizable pref and deduce the fallback...].

--
Ticket URL: <https://trac.torproject.org/projects/tor/ticket/10703>
Tor Bug Tracker & Wiki <https://trac.torproject.org/>
The Tor Project: anonymity online