[tor-commits] [tor-browser-spec/master] Clarify website traffic fingerprinting material a bit.

Mon Apr 28 15:18:48 UTC 2014

commit 0916cc399ec107a7a2aa39df232e191b3a105bd2
Author: Mike Perry <mikeperry-git at fscked.org>
Date:   Fri Mar 8 00:44:57 2013 -0800

    Clarify website traffic fingerprinting material a bit.
    
    Add citations, and improve phrasing.
---
 docs/design/design.xml |   68 +++++++++++++++++++++++++++++++-----------------
 1 file changed, 44 insertions(+), 24 deletions(-)

diff --git a/docs/design/design.xml b/docs/design/design.xml
index f1e3f49..f8beb13 100644
--- a/docs/design/design.xml
+++ b/docs/design/design.xml
@@ -738,41 +738,56 @@ was formerly available only to Javascript.
      <para>
 
 Website traffic fingerprinting is an attempt by the adversary to recognize the
-encrypted traffic patterns of specific websites. The most comprehensive study
-of the statistical properties of this attack against Tor was done by <ulink
+encrypted traffic patterns of specific websites. The most comprehensive
+study of the statistical properties of this attack against Tor was done by
+<ulink
 url="http://lorre.uni.lu/~andriy/papers/acmccs-wpes11-fingerprinting.pdf">Panchenko
 et al</ulink>. Unfortunately, the publication bias in academia has encouraged
 the production of a number of follow-on attack papers claiming "improved"
-success rates using this attack in recognizing only very small numbers of
-websites. Despite these subsequent results, we are skeptical of the efficacy
-of this attack in a real world scenario, especially in the face of any defenses.
+success rates, which are enabled primarily by taking a number of shortcuts
+(such as classifying only very small numbers of websites, neglecting to
+publish ROC curves or at least false positive rates, and/or omitting the
+effects of dataset size on their results). Despite these subsequent
+"improvements" (which in some cases amusingly claim to completely invalidate
+any attempt at defense), we are skeptical of the efficacy of this attack in a
+real world scenario, <emphasis>especially</emphasis> in the face of any
+defenses.
 
      </para>
      <para>
 
-In general, with machine learning, as you increase the number of
-categories to classify with few reliable features to extract, either true
-positive accuracy goes down or the false positive rate goes up.
+In general, with machine learning, as you increase the <ulink
+url="https://en.wikipedia.org/wiki/VC_dimension">number and/or complexity of
+categories to classify</ulink> while maintaining a limit on reliable feature
+information you can extract, you eventually run out of descriptive feature
+information, and either true positive accuracy goes down or the false positive
+rate goes up. This error is called the <ulink
+url="http://www.cs.washington.edu/education/courses/csep573/98sp/lectures/lecture8/sld050.htm">bias
+in your hypothesis space</ulink>. In fact, even for unbiased hypothesis
+spaces, the number of training examples required to achieve a reasonable error
+bound is <ulink
+url="https://en.wikipedia.org/wiki/Probably_approximately_correct_learning#Equivalence">a
+function of the number of categories</ulink> you need to classify.
 
      </para>
       <para>
 
 
 In the case of this attack, the key factors that increase the classification
-requirements (and thus hinder a real world adversary who attempts this attack)
+complexity (and thus hinder a real world adversary who attempts this attack)
 are large numbers of dynamically generated pages, partially cached content,
 and non-web activity in the "Open World" scenario of the entire Tor network.
-This large set of classification categories is further confounded by a poor
-and often noisy available featureset, which is also realtively easy for the
-defender to manipulate.
+This large level of classification complexity is further confounded by a noisy
+and low resolution featureset, one which is also realtively easy for the
+defender to manipulate at low cost.
 
      </para>
      <para>
 
-In fact, the ocean of possible Tor Internet activity makes it a certainty that
-an adversary attempting to classify a large number of sites with poor feature
-resolution will ultimately be overwhelmed by false positives. This problem is
-known in the IDS literature as the <ulink
+In fact, the ocean of Tor Internet activity (at least, when compared to a lab
+setting) makes it a certainty that an adversary attempting to classify a large
+number of sites with poor feature resolution will ultimately be overwhelmed by
+false positives. This problem is known in the IDS literature as the <ulink
 url="http://www.raid-symposium.org/raid99/PAPERS/Axelsson.pdf">Base Rate
 Fallacy</ulink>, and it is the primary reason that anomaly and activity
 classification-based IDS and antivirus systems have failed to materialize in
@@ -1780,7 +1795,7 @@ audio and video objects.
    </para>
   </sect2>
 -->
-  <sect2 id="other">
+  <sect2 id="OtherSecurity">
    <title>Other Security Measures</title>
    <para>
 
@@ -1847,14 +1862,19 @@ developed SPDY as opposed simply extending HTTP to improve pipelining.
      </para>
      <para>
 
-Knowing this, we created the defense as an <ulink
+Knowing this, we created this defense as an <ulink
 url="https://blog.torproject.org/blog/experimental-defense-website-traffic-fingerprinting">experimental
 research prototype</ulink> to help evaluate what could be done in the best
-case with full server support (ie with SPDY).  Unfortunately, the bias in
-favor of compelling attack papers has caused academia to thus far ignore our
-requests, instead publishing only cursory (yet "devastating") evaluations that
-fail to provide even simple statistics such as the rates of actual pipeline
-utilization during their evaluations.
+case with full server support. Unfortunately, the bias in favor of compelling
+attack papers has caused academia to ignore this request thus far, instead
+publishing only cursory (yet "devastating") evaluations that fail to provide
+even simple statistics such as the rates of actual pipeline utilization during
+their evaluations, in addition to the other shortcomings and shortcuts <link
+linkend="website-traffic-fingerprinting">mentioned earlier</link>. We can
+accept that our defense might fail to work as well as others (in fact we
+expect it), but unfortunately the very same shortcuts that provide excellent
+attack results also allow the conclusion that all defenses are broken forever.
+So sadly, we are still left in the dark on this point.
 
      </para>
       </blockquote>
@@ -1864,7 +1884,7 @@ utilization during their evaluations.
      <para>
 
 In order to inform the user when their Tor Browser is out of date, we perform a
-privacy-preserving update check in the asynchronously in the background. The
+privacy-preserving update check asynchronously in the background. The
 check uses Tor to download the file <ulink
 url="https://check.torproject.org/RecommendedTBBVersions">https://check.torproject.org/RecommendedTBBVersions</ulink>
 and searches that version list for the current value for the local preference