[tor-commits] r24670: {projects} Rework draft based on Nick's suggestions: improve abstract, (projects/articles/browser-privacy)

Mike Perry mikeperry-svn at fscked.org
Tue Apr 26 09:35:53 UTC 2011


Author: mikeperry
Date: 2011-04-26 09:35:52 +0000 (Tue, 26 Apr 2011)
New Revision: 24670

Modified:
   projects/articles/browser-privacy/W3CIdentity.bib
   projects/articles/browser-privacy/W3CIdentity.tex
Log:
Rework draft based on Nick's suggestions: improve abstract,
tone down use of identity, introduce ideas better.

Problem is, we've now bled into 6 pages. We need to trim some
fat.



Modified: projects/articles/browser-privacy/W3CIdentity.bib
===================================================================
--- projects/articles/browser-privacy/W3CIdentity.bib	2011-04-26 01:37:07 UTC (rev 24669)
+++ projects/articles/browser-privacy/W3CIdentity.bib	2011-04-26 09:35:52 UTC (rev 24670)
@@ -95,3 +95,13 @@
   author= {Mozilla},
   note = {\url{https://mozillalabs.com/personas/}}
 }
+
+ at Misc{rfc2965,
+   author =      {D. Kristol and L. Montulli},
+   title =       {HTTP State Management Mechanism},
+   howpublished = {IETF RFC 2965},
+   month =       {October},
+   year =        {2000},
+   note =        {\url{http://www.rfc-editor.org/rfc/rfc2965.txt}},
+}
+

Modified: projects/articles/browser-privacy/W3CIdentity.tex
===================================================================
--- projects/articles/browser-privacy/W3CIdentity.tex	2011-04-26 01:37:07 UTC (rev 24669)
+++ projects/articles/browser-privacy/W3CIdentity.tex	2011-04-26 09:35:52 UTC (rev 24670)
@@ -17,7 +17,7 @@
 
 \begin{document}
 
-\title{Bridging the Disconnect Between Web Identity and User Perception}
+\title{Bridging the Disconnect Between Web Privacy and User Perception}
 
 \author{Mike Perry \\ The Internet \\ mikeperry at torproject.org}
 
@@ -32,10 +32,13 @@
 and the reality of their relationship with the websites they visit. This
 position paper explores this disconnect and provides some recommendations for
 making the technical reality of the web match user perception, through both
-technical improvements as well as user interface cues. By looking at all of
-the elements of tracking as though they collectively comprise "User Identity",
-we can make better decisions about improvements to both the technical and the
-interface aspects of authentication and privacy.
+technical improvements as well as user interface cues. We frame the core
+technical problem as one of ``linkability'' -- the level of correlation
+between various online activities that the user naturally expects to be
+independent. We look to address the issue of unexpected linkability through
+both improvements to the web's origin model, as well as through user interface
+cues about the set of accumulated identifiers that can be said to comprise
+a user's online identity.
 
 \end{abstract}
 
@@ -44,8 +47,8 @@
 The prevailing revenue model of the web is an appealing one. Web users receive
 unfettered, frictionless access to an extensive variety of information sources
 in exchange for viewing advertising. This advertising is more valuable if each
-advertisement is more relevant to the current activity, and if possible, more
-relevant to the current user.
+advertisement is relevant to the current activity, and if possible, relevant
+to the current user.
 
 The cost of this is that user privacy on the web is a nightmare. There is
 ubiquitous tracking, unseen partnership agreements and data exchange, and
@@ -56,7 +59,7 @@
 The problem is that the revenue model of the web has incentivized companies to
 find ways to continue to track users against their will, even if those users
 are attempting to protect themselves through currently available methods.
-Starting with the infamous "Flash cookies", we have progressed through a
+Starting with the infamous ``Flash cookies'', we have progressed through a
 seemingly endless arms race of secondary identifiers and tracking information:
 visited history, cache, font and system data, desktop resolution, keystroke
 timing, and so on and so forth\cite{wsj-fingerprinting}.
@@ -71,46 +74,61 @@
 
 To understand and evaluate potential solutions and improvements to this status
 quo, we must explore the disconnect between user experience and the way the
-web actually functions with respect to tracking and identity.
+web actually functions with respect to user tracking.
 
-% FIXME: Do we need this paragraph?
-%To this end, the rest of this document is structured as follows: First, we
-%examine user identity on the web, comparing the average user's perspective to
-%what actually is happening technically behind the scenes, and noting the major
-%disconnects. We then examine solutions attempting to bridge this disconnect
-%from two different directions.
+%
+% 20:16 < nickm> Not "identity-based", though.  identity-separation,
+% identity-isolation.  "nym" and "pseudonym" are also fine words
+% 20:18 < armadev> i'm still not entirely clear on what you mean by the
+% identity model. i am guessing it's "the user thinks of his web 
+%                 experience in terms of whether the website can recognize
+%                 him", but i think that's not it. i want clearer 
+%                definitions up front, and then i can help with terms. :)
 
-We only consider implementations that involve privacy-by-design.
-Privacy-by-policy approaches such as Do Not Track will not be discussed.
 
-\section{User Identity on the Web}
+To this end, the rest of this document is structured as follows: First, we
+examine how the user perceives their privacy on the web, comparing the average
+user's perspective to what actually is happening technically behind the
+scenes, and noting the major disconnects. We then examine solutions attempting
+to bridge this disconnect from two different directions, corresponding to the
+two major sources of disconnect\footnotemark. The first direction is improving
+the linkability issues inherent with the multi-origin model of the web itself.
+The second direction is improving user cues and browser interface to suggest a
+coherent concept of identity to the user, which more accurately reflects the
+set of unique identifiers they have accumulated. Both of these directions can
+be pursued independently.
 
-To properly examine this privacy problem, we must probe into the details of
-both what a User's perception of their identity is, as well as the technical
-realities of what goes into web authentication and tracking.
+\footnotetext{We only consider implementations that involve privacy-by-design.
+Privacy-by-policy approaches such as Do Not Track will not be discussed.}
 
-\subsection{User Perception of Identity}
+\section{User Privacy on the Web}
 
-Instinctively, users define their privacy in terms of their identity, in terms
+To properly examine the privacy problem, we must probe both the average user's
+perception of what their ``web identity'' is, as well as the technical
+realities of web authentication and tracking.
+
+\subsection{User Perception of Privacy}
+
+Instinctively, users define their privacy in terms of their identity: in terms
 of how they have interacted with a site in order to inform it of who they are.
 Typically, the user's perception of their identity on the web is usually a direct
-function of the mechanisms used for strong authentication for particular sites.
+function of the identifiers used for strong authentication for particular sites.
 
 For example, users expect that logging in to Facebook creates a relationship
 in their browsers when facebook.com is present in the URL bar, but they are
-likely not aware that this also extends to their activity on other, arbitrary
-sites that happen to include "Like this on Facebook" buttons or
+typically not aware that this also extends to their activity on other, arbitrary
+sites that happen to include ``Like this on Facebook'' buttons or
 Facebook-sourced advertising content.
 
-Many, if not most, users expect that when they log out of a site their
-relationship ends and that any associated tracking should be over. Even
-users who are aware of cookies can be prone to believing that clearing the
-cookies and private browsing data related to a particular site is sufficient
-to end their relationship with that site.
+Many, if not most, users expect that when they log out of a site, their
+relationship ends and that any associated tracking should be over. Even users
+who are aware of cookies can be prone to believing that clearing the cookies
+related to a particular site is sufficient to end their relationship with that
+site.
 
 Neither of these beliefs has any relation to reality.
 
-\subsection{Technical Reality of Identity}
+\subsection{The Technical Reality of Privacy}
 
 The technical reality of the web today is that users are usually wrong about
 their authentication status with respect to a particular site, and are almost
@@ -118,31 +136,44 @@
 pages. The default experience is such that all of this data exchange is
 concealed from the user.
 
-So then what is identity? In terms of authentication, it would at first appear
-to be cookies, HTTP Auth tokens, and client TLS certificates. However, even this
-begins to break down. High-security websites are already using fingerprinting
-as an auxiliary second factor of authentication\cite{security-fingerprinting},
-and online data aggregators utilize everything they can to build complete
-portraits of users' identities\cite{tracking-identity}.
+So then what comprises the user's web identity for tracking purposes? In terms
+of authentication, it would at first appear to be limited to cookies, HTTP
+Auth tokens, and client TLS certificates. However, this identifier-based
+approach breaks down quickly on the modern web. High-security websites are
+already using fingerprinting as an auxiliary second factor of
+authentication\cite{security-fingerprinting}, and online data aggregators
+utilize everything they can to build complete portraits of users'
+identities\cite{tracking-identity}.
 
-Identity then is a superset of all the authentication tokens used by the
+Despite what the user may believe, their actual web identity then is a
+superset of all the stored identifiers and authentication tokens used by the
 browser. It is the ability to link a user's activity in one instance to their
 activity in another instance, be it across time, or even on the very same page
 due to multiple content origins.
 
-\subsection{Identity as Linkability}
+Therefore, instead of viewing the user's identity as the sum of their
+identifiers, or as their relationship to individual websites, it is best to
+view it as the ability to link activity from one website to activity in
+another website. We will call this property ``user linkability''.
 
-When expanded to cover all items that enable or substantially contribute to
-Linkability, a lot more components of the browser are now in scope. We will
-briefly enumerate these components.
+\subsection{User Privacy as Linkability}
 
+In terms of what the user actually expects, user privacy is more accurately
+modeled as the level of linkability between subsequent actions on the web, as
+opposed to the mere sum of their unique identifiers and authentication tokens.
+
+When privacy is expanded to cover all items that enable or substantially
+contribute to linkability, a lot more components of the browser are now in
+scope. We will briefly enumerate these components.
+
 First, the obvious properties are found in the state of the browser: cookies,
 DOM storage, cache, cryptographic tokens and cryptographic state, and
-location. These are what technical people tend to think of first when it comes
-to private browsing and identity, but they are not the whole story.
+location. These identifiers are what technical people tend to think of first
+when it comes to user identity and private browsing, but they are not the
+whole story.
 
 Next, we have long-term properties of the browser itself. These include the
-User Agent String, the list of installed plugins, rendering capabilities,
+User Agent string, the list of installed plugins, rendering capabilities,
 window decoration size, and browser widget size.
 
 Then, we have properties of the computer. These include desktop size, IP
@@ -156,7 +187,7 @@
 \subsection{Developing a Threat Model}
 
 Unfortunately, just about every browser property and functionality is a
-potential fingerprinting target. In order to properly address the network
+potential linkability target. In order to properly address the network
 adversary on a technical level, we need a metric to measure linkability of the
 various browser properties that extend beyond any stored origin-related state.
 
@@ -172,102 +203,114 @@
 
 \footnotetext{In particular, the test does not take in all aspects of
 resolution information. It did not calculate the size of widgets, window
-decoration, or toolbar size. We believe this may add high amounts of entropy
-to the screen field. It also did not measure clock offset and other time-based
-fingerprints. Furthermore, as new browser features are added, this experiment
-should be repeated to include them.}
+decoration, or toolbar size. We believe these resolution-related properties
+may add high amounts of entropy to the resolution component. They also did not
+measure clock offset and other time-based fingerprints. Furthermore, as new
+browser features are added, the experiment should be repeated to include
+them.}
 
 This metric also indicates that it is beneficial to standardize on
 implementations of fingerprinting resistance where possible. More
 implementations using the same defenses means more users with similar
-fingerprints, which means less entropy in the metric.
+fingerprints, which means less entropy in the metric. Similarly, uniform
+feature deployment leads to less entropy in the metric.
 
 \section{Matching User Perception with Reality}
 
-When the concept of user identity is expanded to cover all aspects of
-linkability, addressing the problem of the disconnect between user perception
-and reality becomes clearer. For users to have privacy, and for private
-browsing modes to function, the relationship between a user and a site must be
-understood by that user.
+For users to have privacy, and for private browsing modes to function, the
+relationship between a user and a site must be understood by that user.
 
 It is apparent that the user experiences disconnect with the technical
 realities of the web on two major fronts: the average user does not grasp the
 privacy implications of the multi-origin model, nor are they given a clear
-concept of identity to grasp the privacy implications of the union of the
-trackable components of their browsers.
+concept of browser identity to grasp the privacy implications of the union
+of the linkable components of their browsers.
 
 We will now examine examples of attempts at reducing this disconnect on each
-of these two fronts.
+of these two fronts. Note that these to fronts are orthogonal. Approaches from
+them may be combined, or used independently.
 
-Note that identity-based approaches and the origin-based approaches are
-orthogonal. They may be combined, or used independently.
+\subsection{Improving the Origin Model}
 
-\subsection{Origin-Based Approaches}
+The current identifier origin model used by the web is fundamentally flawed
+when viewed from the perspective of meeting the expectations of the user.
+Unique, globally linkable identifiers can be transmitted for arbitrary content
+elements on any page, which can be sourced from anywhere without user
+interaction or awareness.
 
-Origin-based approaches seek to improve the technical behavior of the browser
-to make linkability less implicit and more consent-driven. In short, these
-approaches seek to make the web behave more like users currently assume it
-behaves by anchoring browser state to top-level origins as opposed to
-associating it with arbitrary content elements.
+However, the behavior of identifiers and linkable attributes can be improved
+to make linkability less implicit and more consent-driven without the need for
+cumbersome interventionist user interface. Where explicit identifiers exist,
+they should be tied to the pair of the top-level origin and the third-party
+content origin. Where linkability attributes exist, they should be obfuscated
+on a per-origin basis.
 
-The earliest relevant example of this work is SafeCache\cite{safecache}.
+An early relevant example of this idea is SafeCache\cite{safecache}.
 SafeCache seeks to reduce the ability for 3rd party content elements to use
 the cache to store identifiers. It does this by limiting the scope of the
-cache to the origin in the url bar. This has the effect that commonly sourced
-content elements are fetched and cached repeatedly, but this is the desired
-property. Each of these prevalent content elements can be crafted to include
-unique identifiers for each user, tracking users who attempt to avoid tracking
-by clearing cookies.
+cache to the top-level origin in the url bar. This has the effect that
+commonly sourced content elements are fetched and cached repeatedly, but this
+is the desired property. Each of these prevalent content elements can be
+crafted to include unique identifiers for each user, tracking users who
+attempt to avoid tracking by clearing cookies.
 
-Mozilla has a wonderful example of an origin-based improvement written by Dan
-Witte and buried on their wiki\cite{thirdparty}. It describes a new dual-keyed
-origin for cookies, so that cookies would only be transmitted if they matched
-both the top level origin and the third party origin involved in their
-creation. This approach would go a long way towards preventing implicit
-tracking across multiple websites.
+The Mozilla development wiki describes an origin model cookie transmission
+improvement written by Dan Witte\cite{thirdparty}. Dan describes a new
+dual-keyed origin for cookies, so that cookies would only be transmitted if
+they matched both the top level origin and the third party origin involved in
+their creation. This approach would go a long way towards preventing implicit
+tracking across multiple websites, and has some interesting properties that
+make user interaction with content elements more explicitly tied to the
+current site.
 
 Similarly, one could imagine this two-level origin isolation being deployed to
 improve similar issues with DOM Storage and cryptographic tokens.
 
-Making the origin model for browser identifiers more closely match the user
+Making the origin model for browser identifiers more closely match user
 activity and user expectation has other advantages as well. With a clear
-distinction between 3rd party and top-level cookies, the privacy settings
-window could have a user-intuitive way of representing the user's relationship
-with different origins, perhaps by using only the favicon of that top level
-origin to represent all of the browser state accumulated by that origin. The
-user could delete the entire set of browser state (cookies, cache, storage,
-cryptographic tokens) associated with a site simply by removing its favicon
-from their privacy info panel.
+distinction between 3rd party and top-level cookies due to double-keying, the
+privacy settings window could have a user-intuitive way of representing the
+user's relationship with different origins, perhaps by using only the favicon
+of that top level origin to represent all of the browser state accumulated by
+that origin. The user could delete the entire set of browser state (cookies,
+cache, storage, cryptographic tokens) associated with a site simply by
+removing its favicon from their privacy info panel.
 
-The problem with origin-based approaches is that individually, they do not
-fully address the entire linkability problem unless the same restriction is
-applied uniformly to all aspects of stored browser state, and all other
-linkability issues are dealt with. Behind-the-scenes partnerships can easily
-allow companies to continue to link users to their identities through any
-aspect of browser state that is not properly compartmentalized to the top
-level origin and bound to the same rules.
+The problem with origin model improvement approaches is that individually,
+they do not fully address the entire linkability problem unless the same
+restriction is applied uniformly to all aspects of stored browser state, and
+all other linkability issues are dealt with. Behind-the-scenes partnerships
+can easily allow companies to continue to link users to their identities
+through any linkable aspect of browser state that is not properly
+compartmentalized to the top level origin and bound to the same rules as all
+other linkable state.
 
-However, linkability based on browser properties is very amenable to this
-model. In particular, one can imagine per-origin plugin permissions,
-per-origin limits on the number of fonts that can be used, and randomized
-window-specific time offsets.
+However, linkability based on fingerprintable browser properties is also
+amenable to improvement under this model. In particular, one can imagine
+per-origin plugin loading permissions, per-origin limits on the number of
+fonts that can be used, and randomized window-specific time offsets.
 
 So, while these approaches are in fact useful for bringing the technical
 realities of the web closer to what the user assumes is happening, they must
 be deployed uniformly, with a consistent top-level origin restriction model.
-This may take significant coordination and standardization efforts.
+This may take significant coordination and standardization efforts. Without
+this, it is necessary to fill the remaining linkability gaps by presenting
+the user with a visual representation of their overall web identity.
 
-\subsection{Identity-Based Approaches}
+\subsection{Conveying Identity to the User}
 
-We will now discuss what we call the identity-based approaches to privacy.
-These approaches, whether explicitly or implicitly, all model the user's web
-identity as the entirety of the user's state for all origins.
+Even if the origin model of identifier transmission and other linkable
+attributes is altered uniformly to be more in-line with what users expect, it
+is likely that the average user will still experience privacy benefits if the
+browser conveys the sum of all linkable information as a single, storable,
+mutable, and clearable user identity.
 
-The key advantage of identity-based approaches is that they can be simpler
-than origin-based approaches when used to improve the privacy problem on their
-own.
+Providing this concept of identity to the user is also simpler than origin
+improvements, as it does not require extensive compatibility testing or
+standards coordination.
 
-While the earliest example of an identity-based approach is our own work on
+% XXX: Do we need to even mention torbutton?
+One of the earliest examples of an identity-based approach is our own work on
 Torbutton\cite{torbutton}, Torbutton deserves poor marks for both simplicity
 and usability\cite{not-to-toggle}. Torbutton attempts to isolate the user's
 non-Tor activity from their Tor activity, effectively providing the user with
@@ -275,7 +318,7 @@
 between these two identities.
 
 Firefox Private Browsing Mode is very similar, in that it allows users to
-switch between their normal browsing and a "private" clean slate.
+switch between their normal browsing and a ``private'' clean slate.
 
 % FIXME: This paragraph can go if we need space:
 Both Firefox PBM and Torbutton suffer from usability issues, primarily because
@@ -292,7 +335,7 @@
 Firefox and Torbutton to provide the user with great fine-grained control.
 
 Google Chrome's Incognito Mode comes the closest to conveying this idea of
-"Incognito identity" to the user, and the implementation is also simpler as a
+``Incognito identity'' to the user, and the implementation is also simpler as a
 result. The Incognito Mode window is a separate, stylized window that clearly
 conveys an alternate identity is in use for this window, which can be used
 concurrent to the non-private identity. This appears to lead to less mode
@@ -305,10 +348,10 @@
 effort. It also allows them to tweak browser properties and permissions
 specifically for this profile.
 
-The Mozilla Weave project appears to be proposing an identity-based method of
-managing, syncing, and storing authentication tokens, and also has use cases
-described for multiple users of a single browser\cite{weave-manager}. It is
-the closest idea on paper to what we envision as the way to bridge user
+The Mozilla Weave project appears to be proposing an identity-oriented method
+of managing, syncing, and storing authentication tokens, and also has use
+cases described for multiple users of a single browser\cite{weave-manager}. It
+is the closest idea on paper to what we envision as the way to bridge user
 assumptions with reality.
 
 We believe that the user interface of the browser should convey a sense of
@@ -342,28 +385,36 @@
 This is especially true of cellular IP networks.}
 
 Linkability solutions within the identity framework would be similar to the
-origin-based solutions, except they would be properties of the entire browser
+origin model solutions, except they would be properties of the entire browser
 or browser profile, and would be obfuscated only once per identity switch.
 
-% FIXME: Elaborate?
-
 \section{Conclusions}
 
-There is a demand for private browsing, and we believe that solid private
-browsing modes can be created. In order to do this, we need solid analysis of
-the threat models involved, and we need standardization for many aspects of
-defense.
+The appeal of the prevailing revenue model of the web and the difficulties
+associated with altering browser behavior have lulled us into accepting user
+deception as the norm for web use. The average user completely lacks the
+understanding needed to grasp how web tracking is carried out. This disconnect
+in understanding is extreme to the point where moral issues arise about the
+level of consent actually involved in web use and associated tracking.
 
-However, there is currently a huge disconnect between user privacy and
-identity due to both the multi-origin nature of the web, and the failure of
-browsers to adequately convey a sense of identity to the user. It is possible
-to bridge this disconnect both by addressing the issues with the multi-origin
-model, as well as providing the user with an explicit representation of their
-web identity, and with control over this identity.
+In fact, standardization efforts seemed to realize this problem early on but
+failed to create a feasible recommendations for improving the situation. RFC
+2965 governing HTTP State Management mandated in section 3.3.6 that
+third-party origins must not cause the browser to transmit cookies unless the
+interaction is ``verifiable'' and readily apparent to the user\cite{rfc2965}.
+In section 6, it also strongly suggested that informed consent and user
+control should govern the interaction of users to tracking identifiers.
 
-% XXX: The dangers of adblockers and filters + the long-term imperative of
-% improving privacy for the continued use of the advertising revenue model.
+Without changes to browser behavior, browser interface, or both, such informed
+consent is simply not possible on today's web. Several examples from academia
+and practice show that it is possible to bridge this disconnect by addressing
+the linkability issues with the web's origin model with minimal breakage.
+Additionally, the first steps towards providing the user with an explicit
+representation of their web identity have been taken.
 
+The pieces are in place to build robust private browsing modes based on these
+two approaches, and metrics exist to measure their success.
+
 \bibliographystyle{plain} \bibliography{W3CIdentity}
 
 \clearpage



More information about the tor-commits mailing list