[tor-bugs] #22233 [Core Tor/Tor]: Reconsider behavior on .z URLs with Accept-Encoding header

Thu Oct 18 01:46:58 UTC 2018

#22233: Reconsider behavior on .z URLs  with Accept-Encoding header
-------------------------------------------------+-------------------------
 Reporter:  nickm                                |          Owner:  ahf
     Type:  defect                               |         Status:
                                                 |  assigned
 Priority:  Medium                               |      Milestone:  Tor:
                                                 |  unspecified
Component:  Core Tor/Tor                         |        Version:
 Severity:  Normal                               |     Resolution:
 Keywords:  034-triage-20180328,                 |  Actual Points:
  034-removed-20180328                           |
Parent ID:                                       |         Points:
 Reviewer:                                       |        Sponsor:
                                                 |  Sponsor4
-------------------------------------------------+-------------------------

Comment (by Hello71):

 Replying to [comment:4 yawning]:
 > Replying to [comment:3 arma]:
 > > FYI, my wget didn't send any accept-encoding header. Neither did
 Sebastian's. Maybe Yawning's did? You can tell it to *add* an accept-
 encoding header, but then what do you expect.
 >
 > `wget http://example.com` on my system does this:
 >
 > {{{
 > GET / HTTP/1.1
 > User-Agent: Wget/1.19.1 (linux-gnu)
 > Accept: */*
 > Accept-Encoding: identity
 > Host: example.com
 > Connection: Keep-Alive
 > }}}
 >
 > Python's HTTP client also includes the header with `identity`.
 >
 > > I think the issue here is more that there are two ways to indicate you
 want compression -- adding a .z to the url, and saying so in the accept-
 encoding header -- and we should build the two by two decision matrix and
 do the smart thing for all four cases.
 >
 > Yes.  The existing code tries to treat `.z` as `Accept-Encoding:
 deflate`, which is a shortcut, and not always correct.  Assuming we do not
 want to double compress, what I would consider working behavior looks
 like:
 >
 > || File         || Accept-Encoding     || Action
 ||
 > || `foo`        || N/A                 || `foo`
 ||
 > || `foo`        || `identity`          || `Content-Encoding: identity`,
 `foo`          ||
 > || `foo`        || `deflate`           || `Content-Encoding: deflate`,
 `deflate(foo)` ||
 > || `foo`        || `identity, deflate` || `Content-Encoding: deflate`,
 `deflate(foo)` ||
 > || `foo`        || `identity, gzip`    || `Content-Encoding: gzip`,
 `gzip(foo)`    ||
 > || `foo`        || `gzip`              || `Content-Encoding: gzip`,
 `gzip(foo)`    ||
 > || `foo`        || `deflate, gzip`     || `Content-Encoding: gzip`,
 `gzip(foo)`    ||
 > || `foo.z`      || N/A                 || `deflate(foo)`
 ||
 > || `foo.z`      || `identity`          || `Content-Encoding: identity`,
 `deflate(foo)` ||
 > || `foo.z`      || `deflate`           || `406 Not Acceptable`
 ||
 > || `foo.z`      || `identity, deflate` || `Content-Encoding: identity`,
 `deflate(foo)` ||
 > || `foo.z`      || `identity, gzip`    || `Content-Encoding: identity`,
 `deflate(foo)` ||
 > || `foo.z`      || `gzip`              || `406 Not Acceptable`
 ||
 > || `foo.z`      || `deflate, gzip`     || `406 Not Acceptable`
 ||
 >
 > (`gzip` used as a placeholder algorithm for "Something that is supported
 that is not `deflate`)
 >
 > The current code mishandles the cases in the table that should either
 double compress or return `406`.

 I believe this is not consistent with modern HTTP and web client behavior.
 I am fairly sure that modern web clients do one of the following:

 1. send Accept-Encoding: deflate, gzip (or gzip, deflate)
 2. if the response is Content-Encoding: deflate or gzip, transparently
 decompress it.
 3. process the decompressed content as the type indicated in Content-Type.

 1. do not send Accept-Encoding, or send Accept-Encoding: identity
 2. do not decompress the content
 3. process the content as the type indicated in Content-Type.

 Note that not sending any Accept-Encoding is identical to sending Accept-
 Encoding: identity, as specified in RFC 7231
 (https://tools.ietf.org/html/rfc7231#section-5.3.4).

 I am fairly sure that this behavior also does not depend on the file
 extension of the URL. Therefore, it is not correct to return 406 if the
 server thinks that compressing the content is stupid (note that this is
 not just the case for gzipped files. it also applies to image files, video
 files, font files, and so on; too many for the browser to even attempt to
 make a comprehensive list of file extensions). Instead, it should simply
 not compress the content, not send Content-Encoding: identity, and send it
 as is. You can see this behavior if you execute for example `curl
 --compressed -v torproject.org`. Compression is offered, but the server
 doesn't want to bother, so it just doesn't compress it. This is supported
 by https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Accept-
 Encoding, which says "As long as the identity value, meaning no encoding,
 is not explicitly forbidden, by an identity;q=0 or a *;q=0 without another
 explicitly set value for identity, the server must never send back a 406
 Not Acceptable error.".

 Therefore, I think your table should look more like this:

 > || File         || Accept-Encoding     || Action
 ||
 > || `foo`        || none or `identity`  || no Content-Encoding, `foo`
 ||
 > || `foo`        || `deflate`           || `Content-Encoding: deflate`,
 `deflate(foo)` ||
 > || `foo`        || `gzip`              || `Content-Encoding: gzip`,
 `gzip(foo)`    ||
 > || `foo`        || `deflate, gzip`     || `Content-Encoding: deflate` or
 `gzip`, `deflate(foo)` or `gzip(foo)` respectively    ||
 > || `foo.z`      || none or `identity`  || no Content-Encoding,
 `deflate(foo)`          ||
 > || `foo.z`      || `deflate`           || no Content-Encoding,
 `deflate(foo)`          ||
 > || `foo.z`      || `gzip`              || no Content-Encoding,
 `deflate(foo)`          ||
 > || `foo.z`      || `deflate, gzip`     || no Content-Encoding,
 `deflate(foo)`          ||

 I doubt there exist any actual modern web clients than do not fit one of
 these. If there are, it's probably fine to send them whatever as long as
 they accept it, explicitly or implicitly.

 Note that this guarantees that anybody who requests `foo` will see the
 actual contents of `foo` in their browser, or saved to their disk or
 whatever. Additionally, anybody who requests `foo.z` will always receive a
 deflated version of `foo`, and (theoretically) will not have their browser
 decompress it behind their backs. Also, we do not unnecessarily compress
 anything twice.

 For what it's worth, my wget also sends `Accept-Encoding: identity` by
 default. I'm using wget 1.19.5.

--
Ticket URL: <https://trac.torproject.org/projects/tor/ticket/22233#comment:12>
Tor Bug Tracker & Wiki <https://trac.torproject.org/>
The Tor Project: anonymity online