[tor-dev] GoSC - Website Fingerprinting project

Mike Perry mikeperry at torproject.org
Wed Mar 19 17:22:54 UTC 2014


Marc Juarez:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
> 
> Hi everyone,
> 
> Thanks for the answers. Some inline comments below.
> 
> On 03/12/2014 09:25 PM, Roger Dingledine wrote:
> > It sounds like the implementation might be the easy part, compared to the
> > design part. And since GSoC is mostly about implementation, and we would
> > want to be confident you won't spend all summer distracted by design,
> > it would be best if your proposal includes a lot of the design.
> 
> Yes, the project has an important research component that is still
> ongoing. I guess this proposal is not well suited for a GSoC project.
> Still, I am very interested in getting involved in the Tor dev community
> and specially in the implementation of website fingerprinting
> countermeasures. So, I would like to volunteer for this while I continue
> with the research.

I disagree. 

I think prototyping a solution is a great GSoC project, assuming you can
begin work on the prototype immediately and work on it during the
summer. We have a fairly clear idea what we think a padding protocol
should look like (at least for an initial prototype), and if you keep
the scope narrow, you should be able to generate running code as a
pluggable transport done during the summer project portion, at the very
least.

If it ends up too easy, you can always start the work on proposing and
transferring your padding protocol to tor-core itself, but I still think
prototyping in PT land first is a wise move.

I would consider mentoring such an app (at the very least as a secondary
mentor), and I suspect other people on the pluggable transport side
would as well.

I still encourage you to at least submit an application and see what
happens, but keep it focused on the coding side.

I look forward to seeing a GSoC proposal from you, and keep in mind that
the deadline is Friday. We are also allowed to ask you for updates and
clarifications to your proposal all the way up until April 18th.

> On 03/12/2014 09:25 PM, Roger Dingledine wrote:
> > It seems like some of the approaches would best be done inside Tor (as
> > modifications to the Tor program), and some of them would best be done
> > in a separate pluggable transport? Or should they all be done in a PT?
> > Can the bandwidth shaping in Scramblesuit (obfsproxy) be used as a
> > building block here?
> 
> On Wed 12 Mar 2014 11:14:06 PM CET, George Kadianakis wrote:
> > If we ever wanted to deploy these anti-traffic-analysis PTs to the
> > whole network, we would have to add PT support to all clients (HSes
> > might also benefit from this) and to all relays.
> 
> On Tue, Mar 18, 2014 at 7:30 PM, Mike Perry <mikeperry at torproject.org
> > This sounds like a good summer-sized amount of work. I think I am in
> > agreement with George that pluggable transports are a good place to
> > start for prototyping this work. That way, you can experiment with
> > custom padding protocols easily, without needing to make invasive
> > changes to tor-core for each revision, each time.
> 
> Yes, PTs look like a good starting point. However, I think that for a
> final countermeasure it is required to pad until the middle-node (to
> frustrate and adversary who controls the guard) and, as George said, I
> guess deeper modifications will be needed in this case.

Right. I suggested PTs because they will be easy to reign in scope and
rapid prototype. I am also interested in padding to the middle node, but
be aware that in both cases (Guard and middle), your defense should be
tunable in terms of the capacity it consumes. It will be possible (and
likely) that we'll have consensus parameters that specify how much
traffic is available for padding to both Guard and Middle node, when we
roll this out.

I actually missed George's concerns about having to roll out PTs to all
nodes. I don't think that really matters or is a concern. If the PT
design looks good and functions well (especially if you have a working
classifier), it will be much easier to make a convincing argument to
deploy the mechanisms in tor-core, and will also likely make that
implementation work smoother and more focused on exactly what you
discover you do need from the prototype.

If it turns out to be too easy, you can begin work on implementing (or
at least proposing) your mechanisms in tor-core.
 
> On Tue, Mar 18, 2014 at 7:30 PM, Mike Perry <mikeperry at torproject.org
> > For example, it would be neat to be able to transmit a set of statistics
> > to your bridge node during the connection handshake or with the circuit
> > setup, so that you don't have to always request downstream padding cells
> > with a upstream cell, and downstream padding can asynchronously arrive
> > according to some probability or histrogram distribution you specify.
> >
> > You could also obviously specify a number of cells to send in response
> > to a padding cell request (from O..N, where N is some reasonable cap
> > similar to a largeish web object size). The current Tor link padding
> > protocol supports neither of these operations.
> 
> Definitely. These are the type of building blocks I was referring to.
> Any padding-based countermeasure would benefit of the modifications you
> proposed. Something that I think it will be needed for any smart defense
> that tries to minimize the overhead is some regularly updated database
> of webpage traces. Since websites change constantly and we should assume
> that the adversary will train on the most recent data, the defense has
> to act accordingly.

Well maybe. It will be important to have the padding protocol support
primitives such that we can deploy a wide variety of padding mechanisms.

It may be the case that some statistical padding models don't require
frequent update, at the expense of decreased traffic efficiency (which
depending on overall accuracy reduction, may be an acceptable
tradeoff).

I am very optimistic about the efficacy of any defenses. One major dirty
secret of all of this work is that in the "Open World", typically only
one single web page (or a very small number) are recognized. When that
classifier is strained with more labels, I suspect it will quickly
collapse on its own.
 
> On Tue, Mar 18, 2014 at 7:30 PM, Mike Perry <mikeperry at torproject.org
> > Related: Do you happen to have any existing classifier code working
> > already, by any chance?
> 
> Yes. We are using a modified version of Tao’s classifier (edit-distance
> based SVM) for the research project.
> 
> On 03/19/2014 04:25 AM, Kevin P Dyer wrote:
> > If It helps, the code [2] from our website fingerprinting paper [1] is
> > public. It includes the edit-distance classifier [3] from [4], which
> > wasn't reported on in [1], I believe.
> 
> Thanks, Kevin. In particular, we are using the Damerau-Levenshtein
> distance instead of the plain Levenshtein, which also takes into account
> transpositions. I think it describes better network traffic, specially
> when RP is used.
> 
> On Tue, Mar 18, 2014 at 7:30 PM, Mike Perry <mikeperry at torproject.org
> > One of the ideas I've been considering is taking a closer look at the
> > nearest-neighbor edit distances between page class labels, for the edit
> > distance based classifiers. This distance provides us with an estimate
> > of the ideal minimum cover traffic we will need to make testing
> > instances jump from one nearest-neighbor label to another (causing a
> > false positive). It will also decrease as the world size increases (more
> > class labels in the same amount of N-dimensional space).
> > 
> > A successful defense should change of the distribution of edit distances
> > of test instances around their class labels (it will increase the
> > intra-class variance) and this in turn will increase the size of the
> > threshold around class labels for a given accuracy rate, reducing
> > accuracy or increasing false positives.
> 
> Yes, that sounds very interesting. I guess something relevant in such a
> scheme is the distribution used to pick the web page in the k-NN cluster
> that will be used to generate the cover traffic. It would be interesting
> to see what is the distribution that minimizes the overhead introduced
> by this cover traffic while preventing statistical attacks.

Right. Or modeling the nature of the edit distances between many
different labels. At scale, statistical patterns may emerge between
clusters of labels. I suspect there are certain types of edit distances
that are fairly common between many similar sites.

-- 
Mike Perry
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 801 bytes
Desc: Digital signature
URL: <http://lists.torproject.org/pipermail/tor-dev/attachments/20140319/aa6d4d60/attachment.sig>


More information about the tor-dev mailing list