[tor-dev] WTF-PAD and the future

Fri Jul 27 19:03:15 UTC 2018

George Kadianakis:
> Hello Mike,
> 
> I had a talk with Marc and Mohsen today about WTF-PAD. I now understand
> much more about WTF-PAD and how it works with regards to histograms.  I
> think I might even understand enough to start some sort of conversation
> about it:
> 
> Here are some takeaways:
> 
> 1) Marc and Mohsen think that WTF-PAD might not be the way forward
>    because of its various drawbacks and its complexity. Apparently there
>    are various attacks on WTF-PAD that Roger has discovered (SENDME
>    cells side-channels?) and also the deep learning crowd has done some
>    pretty good damage to the WTF-PAD padding (90%-60% accuracy?). They
>    also told me that achieving needed precision on the timings might be
>    a PITA.

Are there citations for any of this? Last I heard Matt Wright was
working on a deep learning study but the results were mixed.

Furthermore, we need to do adversarial learning and other optimizations
on these histograms to tune them. They are a generalized approach. Just
like it is not a valid evaluation to train a classifier on a dataset and
then add a new defense and show that it can't classify the defended
traffic using the old model, it is similarly not accurate to develop an
attack on WTF-PAD with a new classifier without also adversarially
optimizing the WTF-PAD histograms under that classifier. When you do
this, your results are not invalidating WTF-PAD, they are only
invalidating the histograms that were tuned against the previous
classifier/attack.

The same thing applies to the SENDME concern. The core piece of the
SENDME issue is "Tor should never send more than 1000 cells without a
SENDME. So *IF* I can tell which cells are SENDMEs, and *IF* I see more
than 1000 cells between them, then AHA I know that some cells are
actually padding and not real traffic".

Both of these are very big *IF*s, and even if they were shown to be
valid assumptions (which AFAIK they have not been), that does not mean
that it is actually useful for a classifier to know the percentage of
padding after 1000 cells, and it also does not mean that there isn't a
simple tweak to the histograms that encodes what looks like SENDME
transmission to that classifier.

> 2) From what I understand you are also hoping to use WTF-PAD to protect
>    against circuit fingerprinting and not just website
>    fingerprinting. They told me that while this might be plausible,
>    there is no current research on how well it can achieve that.  Are we
>    hoping to do that? And what research remains here? How can I help?
>    Which parts of the Tor circuit protocol are we hoping to hide?

I am designing WTF-PAD to be a framework for deploying padding against
arbitrary traffic analysis attacks. It is meant to allow us to define
histograms on the fly (in the Tor consensus) as these are studied. The
fact that they have not yet been studied is not super relevant to
deploying the framework for it now.

> 3) Marc and Mohsen suggested using application-layer defences because
>    the application-layer has much better view of the actual structures
>    that are sent on the wire, instead of the black box view that the
>    network layer has.
> 
>    In particular they were mainly concerned about onion services
>    fingerprinting because they are part of a restricted closed world,
>    whereas they were less concerned about the entire internet because of
>    its vast size.
> 
>    They suggested that we could investigate using the service-side
>    "alpaca" library for onion services (e.g. as part of securedrop?)
>    which should resolve the most pressing concern of HS identification.

I mean yeah application-layer defenses are useful for website traffic
fingerprinting, but that is a very narrow slice of the traffic analysis
problems that I want this framework to solve.

WTF-PAD also doesn't rule out hidden service operators using alpaca,
either. 

> 4) They also told me of research by Tobias Pulls which eliminates the
>    needs for histograms in WTF-PAD and instead it samples from the
>    probability distribution directly. They think that this can simplify
>    things somewhat. Any thoughts on this?

Yes this is actually exactly what I want to do with the next iteration
of WTF-PAD! The question is what form/model to use for these probability
distributions. Right now we're encoding inter-burst and inter-packet
timings with some weird geometric distribution determining how long
these bursts should go on for, when it might be more natural to encode
and sample from length-based distributions/histograms.

(Histograms vs distribution is not the problem -- its what they encode
and how they encode it that matters).

I don't see this paper on Tobias's website. Is it up anywhere yet?

> Let me know what you think. I still don't understand the entire space
> completely yet, so please be gentle. ;) 

I hope I was gentle enough. If there's anything that triggers rage mode
in me me more than someone being wrong on the internet, it's FUD and
hand-wringing being spread on the internet. ;)

-- 
Mike Perry
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 801 bytes
Desc: Digital signature
URL: <http://lists.torproject.org/pipermail/tor-dev/attachments/20180727/fdc307bb/attachment.sig>