Hi folks,
Here are some brainstorming thoughts to get us started improving our
"unclassifiable protocols" idea.
----------------------------------------------------------------------
For context, right now we have essentially one deployed design in this
space, which is obfs4:
https://gitweb.torproject.org/pluggable-transports/obfs4.git/tree/doc/obfs4…
This protocol does a short handshake (one client message, one server
message) to establish session keys, then it just sends bytes back and
forth with an (additional) layer of encryption so no payload bytes
assist an attacker with doing content-based protocol classification.
----------------------------------------------------------------------
One of the more thorough analyses of obfs4 comes from the FTE folks:
https://www.freehaven.net/anonbib/#ccs2015obfsdetection
where they show a classifier that assesses entropy of the first
server-side packet, plus looking at the length of that initial message,
and they can consistently pick out obfs4 flows with only a 0.25% false
positive rate compared to background traffic they recorded at their
university.
That false positive rate is where it gets interesting: it sounds good
until we realize that it means it flags 1 in every 400 flows as being
obfs4, i.e. it falsely tags about 36000 flows in the 14M flow dataset.
And the false positives aren't randomly distributed either: 1.22% of
their SSL flows are falsely tagged as obfs4, and it's the ones where
the SNI in the SSL handshake has a lot of entropy (because everything
else in the SSL handshake is already random). Also, entire protocols
like RTMP get caught up and tagged as obfs4, because their handshake
starts out with a block of random bytes.
A key takeaway is that if they ran this classifier, with its high true
positive rate and low false positive rate, on a large country's network
traffic, approximately none of the flows that it would tag as obfs4 would
actually be obfs4. Their Section 7 in that paper has a good discussion
of the base rate issues and how they apply here.
Are there other papers that design or analyze classifiers for obfs4?
----------------------------------------------------------------------
The good news: It would seem that obfs4 is most effective in the "long
tail of varied traffic" scenario. That is, considering the spectrum
between "small corporate network" and "China backbone", obfs4 needs that
more broad background traffic in order to make its false positives too
painful to block.
The bad news: I still worry about an attack that puts together many
building blocks, each of which individually is like the "1% false
positive rate" classifier in the above paper, but that together drive
their false positive rate low enough that blocking is safe to do. One
observation there is that the more complexity there is to a protocol,
the harder it is to "really" look like it, i.e. to match it in all
dimensions at once.
Consider this lovingly handcrafted graph, where the X axis is how
thoroughly we try to look like some expected protocol, and the Y axis
is how close we can get to making the adversary unwilling to block us:
1_| |_1
| |
|\ /|
| \ / |
| \ / |
| \ / |
0_| \___________________________________________/ |_0
| |
There's a huge valley in the middle, where we're trying to look like
something, but we're not doing it well at all, so there is little damage
from blocking us.
The ramp on the right is the game that FTE tries to play, where in theory
if they're not perfectly implementing their target protocol then the
adversary can deploy a distinguisher and separate their traffic from
the "real" protocol, but in practice there are enough broken and weird
implementations of the protocol in the wild that even being "close enough"
might cause the censor to hesitate.
And the ramp on the left is our unclassifiable traffic game, where we
avoid looking like any particular expected protocol, and instead rely
on the long tail of legitimate traffic to include something that we
blend with.
Observation: *both* of these ramps are in the "broken in theory, but works
in practice" situation. I started out thinking it was just the obfs4 one,
and that the FTE one was somehow better grounded in theory. But neither
side is actually trying to build the ideal perfect protocol, whatever
that even is. The game in both cases is about false positives, which
come from messy (i.e. hard to predict) real-world background traffic.
One research question I wrestled with while making the graph: which ramp
is steeper? That is, which of these approaches is more forgiving? Does one
of them have a narrower margin for error than the other, where you really
have to be right up against the asymptote or it doesn't work so well?
For both approaches, their success depends on the variety of expected
background traffic.
The steepness of the right-hand (look-like-something) ramp also varies
greatly based on the specific protocol it's trying to look like. At first
glance we might think that the more complex the protocol, the better
you're going to need to be at imitating it in all dimensions. That is,
the more aspects of the protocol you need to get right, the more likely
you'll slip up on one of them. But competing in the other direction is:
the more complex the protocol, the more broken weird implementations
there could be in practice.
I raise the protocol complexity question here because I think it
has something subtle and critical to do with the look-like-nothing
strategy. Each dimension of the protocol represents another opportunity
to deploy a classifier building block, where each classifier by itself
is too risky to rely on, but the composition of these blocks produces
the killer distinguisher. One of the features of the unclassifiable
protocol that we need to maintain, as we explore variations of it, is
the simplicity: it needs to be the case that the adversary can't string
together enough building-block classifiers to reach high confidence. We
need to force them into building classifiers for *every other protocol*,
rather than letting them build a classifier for our protocol.
(I'll also notice that I'm mushing together the notion of protocol
complexity with other notions like implementation popularity and
diversity: a complex proprietary protocol with only one implementation
will be no fun to imitate, but the same level of complexity where every
vendor implements their own version will be much more workable.)
----------------------------------------------------------------------
I've heard two main proposed ways in which we could improve obfs in
theory -- and hopefully thus in practice too:
(A) Aaron's idea of using the latest adversarial machine learning
approaches to evolve a traffic transform that resists classifying. That
is, play off the classifiers with our transform, in many different
background traffic scenarios, such that we end up with a transform
that resists classifying (low true positive and/or high false positive)
in many of the scenarios.
(B) Philipp's idea from scramblesuit of having the transform be
parameterized, and for each bridge we choose and stick with a given
set of parameters. That way we're not generating *flows* that each
aim to blend in, but rather we're generating bridges that each aim to
blend differently. This diversity should adapt well to many different
background traffic scenarios because in every scenario some bridges
might be identified but some bridges will stay under the radar.
At first glance these two approaches look orthogonal, i.e. we can do both
of them at once. For example, Aaron's approach tells us the universe of
acceptable parameters, and Philipp's approach gives us diversity within
that universe.
Aaron: How do we choose the parameter space for our transforms to
explore? How much of that can be automated, and how much needs to be
handcrafted by subject-matter-experts? I see how deep learning can produce
a magic black-box classifier, but I don't yet see how that approach can
present us with a magic black-box transform.
And as a last note, it would sure be nice to produce transforms that
are robust relative to background traffic, i.e. to not be brittle or
overfit to a particular scenario. Otherwise we're giving up one of our few
advantages in the arms race, which is that right now we force the censor
to characterize the traffic -- including expected future traffic! --
and assess whether it's safe to block us.
There. Hopefully some of these ideas will cause you to have better
ideas. :)
--Roger
Hi Philipp!
I am finishing up my Defcon slides, and I realized I don't know where
to point people for setting up obfs4 bridges.
Google sends me to https://support.torproject.org/operators/operators-6/
but I bet that's not up-to-date (but maybe it should become up-to-date).
On the tor-relays list I see
https://trac.torproject.org/projects/tor/wiki/doc/PluggableTransports/obfs4…
which is (a) a really long url, and (b) an intimidatingly long page.
And I hear irl is working a 'tor-bridge' meta-deb:
https://bugs.torproject.org/31153
So: assuming I want to tell a bunch of people to do something in a week
and a half, using a url they can hastily scribble down from my slide,
what should I tell them? :)
It is ok if the url doesn't say the right thing today but we're confident
it will say the right thing on the morning of Aug 8.
Thanks,
--Roger
I found a more extensive description of the MASQUE protocol:
<https://davidschinazi.github.io/masque-drafts/draft-schinazi-masque.html>
Here are my key take-aways after reading the draft:
* MASQUE enables circumvention by hiding a circumvention proxy behind a
web server, similar to Sergey's httpsproxy. Clients "unlock" the web
server's circumvention feature by using the newly-proposed HTTP
Transport Authentication Standard:
<https://tools.ietf.org/html/draft-schinazi-httpbis-transport-auth-00>
In a nutshell, clients need to send a CONNECT request with a transport
authentication header to .well-known/masque/initial. Crucially,
MASQUE defends against active probing by responding with "405 Method
Not Allowed" to failed authentication attempts -- the same response
one would get for an unexpected CONNECT request. This prevents
censors from learning if a web server supports MASQUE.
* Once a client "unlocked" a MASQUE server, one can tunnel several types
of traffic over the MASQUE server. One can use it as an HTTP proxy,
send DNS requests, and do both UDP and TCP proxying.
* MASQUE supports QUIC over HTTP/3 and TLS 1.3 over HTTP/2. There is a
fallback mechanism from HTTP/3 to HTTP/2 to provide a disincentive for
censors to block QUIC or HTTP/3.
* MASQUE only provides obfuscation and does not provide anonymity. The
document suggests onion routing in Section 2.4 to work around this
shortcoming.
* MASQUE does not defend against traffic analysis but QUIC supports
padding, so there's a mechanism to mitigate this problem. Traffic
analysis defence is left for future work in Section 7.2.
* QUIC has a "connection migration" feature that allows clients to
seamlessly switch end-to-end connections from one MASQUE server to
another.
* Similar to onion services, MASQUE makes it possible to make available
a server behind NAT by using a MASQUE server as a rendez-vous
mechanism.
There's an IETF mailing list for the project:
<https://mailarchive.ietf.org/arch/browse/masque/>
MASQUE may make a good pluggable transport and we should engage early to
make this happen. I intend to suggest this idea on their mailing list
soon.
Cheers,
Philipp
These are in response to a conversation with phw and cohosh about
observing how people use Tor Browser.
Here are links relating to the 2015 UX Sprint, at which we did
one-on-one observations of a few users.
https://blog.torproject.org/ux-sprint-2015-wrapuphttps://trac.torproject.org/projects/tor/wiki/org/meetings/2015UXsprint
The videos of the sessions are here:
https://people.torproject.org/~dcf/uxsprint2015/
The source code for our screen recording setup is here. I'm attaching
the README.
https://www.bamsoftware.com/git/repo.eecs.berkeley.edu/tor-ux.git
Here's a rough description of the whole process.
We ran user experiments over two days. We were hampered by late
recruiting (Berkeley only approved the experiment literally the
day before) and we only had five participants, three on Saturday
and two on Sunday. But that number turned out to be plenty
because of how labor-intensive the process was.
We had all the devs in a room downstairs with a projector. The
experiments were run upstairs in a smaller room on a laptop.
What we did was screencast the laptop screen to the projector
downstairs (using VLC streaming through an SSH tunnel).
Developers could watch user interactions live, and we also
recorded the whole thing. We audio recorded the subjects' voices
on handheld audio recorders (we encouraged them to think out
loud as they were using the software). The idea was to
transcribe the audio and add it to the screen recordings as
subtitles. That was a huge amount of work, but the videos are
really interesting and enlightening.
We gave participants the choice of a Windows or Mac laptop. It
happens that all of the chose Mac, though we had Windows ready
to go. Windows was a big pain to set up with VLC and SSH; I
might do that differently a second time.
We originally planned to run up to two experiments
simultaneously. That would have been way too hectic. It's a lot
of work greeting people, doing paperwork, doing the experiment,
making sure you reset the computer in between experiments--we
really needed an extra person downstairs all that time. With
more researchers (i.e., people allowed to interact with
participants) it would be possible.
We budgeted one hour of one-on-one time with each participant.
It ended up taking between 15 and 30 minutes of actual
experiment for each (that's how long the resultant videos are),
but that doesn't count all the paperwork and setup time. 60
minutes of actual experiment would have been too long. It could
be that long if it were not one on one. But also, there was high
variance in how long things took to complete. Some users took
longer to install than others. Another took a long time trying
to find a specific UI element.
The repo for our PETS 2017 paper is here:
https://github.com/lindanlee/PETS2017-paper
The screen capture videos from a pilot session of recording are here:
https://github.com/lindanlee/PETS2017-paper/tree/master/sessions/pre/videos
Scripts relating to setting up the simulated censorship firewall are
here:
https://github.com/lindanlee/PETS2017-paper/tree/master/experiment
I skimmed Matic et al.'s NDSS'17 paper "Dissecting Tor Bridges: a
Security Evaluation of Their Private and Public Infrastructures":
<https://censorbib.nymity.ch/pdf/Matic2017a.pdf>
Below are the points that stood out the most to me. Note however that
the study is from 2017 and some numbers may no longer be valid.
* With the exception of China and the U.S., default bridges served by
far the most users. The popularity of default bridges makes them a
single point of failure that is easy for censors to block.
We have been losing default bridges over the last few months and
should be recruiting new operators:
<https://trac.torproject.org/projects/tor/wiki/doc/TorBrowser/DefaultBridges>
We have a list of criteria for new operators at the bottom of the
page. If you are interested in running a default bridge, please get
in touch.
* Four OR ports (443, 8443, 444, and 9001) were used much more often
than others. Today, 19% of bridges use port 9001 and 17% use port
443. Port 9001 is problematic because it is an attractive target for
Internet-wide scans. So is port 443 but it has some merit for users
who seek to circumvent corporate firewalls. Is there any reason at
all to have bridges listen on port 9001? If not, we should ask these
operators to pick a new port.
* If a censor discovers a bridge and the bridge runs an SSH server,
the censor can fetch its fingerprint and use Shodan to find other SSH
servers with the same fingerprint. These servers may also be running
bridges. Bridge-specific SSH keys would fix this problem. We may
want to create a "bridge opsec guide" for subtle issues like these.
* Shodan and Censys are search engines for Internet-wide scans. The
authors were successful in using these datasets to find 35% of
"public" bridges, i.e., bridges that publish their server descriptor
to the bridge authority. The idea is to look for certificates that
resemble a Tor bridge and then actively probe the port to confirm this
suspicion. This is a tricky balancing act: most bridges should
probably avoid the ports that Shodan and Censys scan but we need a few
of them for users whose firewalls whitelist these ports.
* Many bridges run transports that are both resistant *and* vulnerable
to the GFW's active probing attacks. The vulnerable protocols are a
liability to the resistant protocols. We fixed this issue in BridgeDB
(#28655) but it remains a problem for Internet-wide scans: if a censor
discovers a bridge via a port 9001 scan, obfs4's probing-resistance
doesn't help. The need to haven an open OR port (#7349) remains a
painful issue.
Also, wouldn't it be useful to have a mechanism to instruct bridges to
stop serving a transport? At this point, there is no reason to still
serve obfs2, obfs3, or ScrambleSuit.
Cheers,
Philipp