[tor-project] Constructing a real-world dataset for studying website fingerprinting

Thu Apr 20 21:16:04 UTC 2023

Hello Tor friends,

We are planning to construct a real-world dataset for studying Tor website
fingerprinting that researchers and developers can use to evaluate potential
attacks and to design informed defenses that improve Tor’s resistance to such
attacks. We believe the dataset will help us make Tor safer, because it will
allow us to design defenses that can be shown to protect *real* Tor traffic
instead of *synthetic* traffic. This will help ground our evaluation of proposed
defenses in reality and help us more confidently decide which, if any, defense
is worth deploying in Tor.

We have submitted detailed technical plans for constructing a dataset to the Tor
Research Safety Board and after some iteration have arrived at a plan in which
we believe the benefits outweigh the risks. We are now sharing an overview of
our plan with the broader community to provide an opportunity for comment.

More details are below. Please let us know if you have comments.

Peace, love, and positivity,
Rob

P.S. Apologies for posting near the end of the work-week, but I wanted to get
this out in case people want to talk to me about it in Costa Rica.

===

BACKGROUND

Website fingerprinting attacks distill traffic patterns observed between a
client and Tor entry into a sequence of packet directions: -1 if a packet is
sent toward the destination, +1 if a packet is sent from toward the client. An
attacker can collect a list of these directions and then train machine learning
classifiers to associate a website domain name or url with the particular list
of directions observed when visiting that website. Once it does this training,
then when it observes a new list of directions it can use the trained model to
predict which website corresponds to that pattern.

For example, suppose [-1,-1,+1,+1] is associated with website1 and [-1,+1,-1,+1]
is associated with website2. There are two steps in an attack:

Step 1:
In the first step the attacker itself visits website1 and website2 many times
and learns:
[-1,-1,+1,+1] -> website1
[-1,+1,-1,+1] -> website2
It trains a machine learning model to learn this association. 

Step 2:
In the second step, with the trained model in hand, the attacker monitors a Tor
client (maybe the attacker is the client’s ISP, or some other entity in a
position to observe a client’s traffic) and when it observes the pattern:
[-1,-1,+1,+1]
the model will predict that the client went to website1. This example is
*extremely* simplified, but I hope gives an idea how the attack works.

PROBLEM

Because researchers don’t know which websites Tor users are visiting, it’s hard
to do a very good job creating a representative dataset that can be used to
accurately evaluate attacks or defenses (i.e., to emulate steps 1 and 2). The
standard technique has been to just select popular websites from top website
lists (e.g., Alexa or Tranco) and then set up a Tor webpage crawler to visit the
front-pages of those websites over and over and over again. Then they use that
data to write papers. This approach has several problems:

- Low traffic diversity: Tor users don’t only visit front-pages. For example,
they may conduct a web search and then click a link that brings them directly to
an internal page of a website. The patterns produced from front-page visits may
be simpler and unrepresentative of the patterns that would be observed from more
complicated internal pages.

- Low browser diversity: It has been shown by research from Marc Juarez [0] and
others that webpage crawlers used by researchers lack diversity in important
aspects that cause us to overestimate the accuracy of WF attacks. For example,
the browser versions, configuration choices, variation in behavior (e.g., using
multiple tabs at once), and network location of the client can all significantly
affect the observable traffic patterns in ways that a crawler methodology does
not capture.

- Data staleness: Researchers collect data over a short time-frame and then
evaluate the attacks assuming this static dataset. In the real world, websites
are being updated over time, and a model trained on an old version of a website
may not transfer to the new version.

In addition to the above problems in methodology, current research also causes
incidental consequences for the Tor network:

- Network overhead: machine learning is a hot topic and several research groups
have crawled tens of thousands of websites over Tor many times each. While each
individual page load might be insignificant compared with the normal usage of
Tor, crawling does add additional load to the network and can contribute to
congestion and performance bottlenecks.

Researchers have been designing attacks that are shown to be extremely accurate
using the above synthetic crawling methodology. But because of the above
problems, we don’t properly understand the *true* threat of the attack against
the Tor network. It is possible that the simplicity of the crawling approach is
what makes the attacks work well, and that the attacks would not work as well if
evaluated with more realistic traffic and browser diversity.

PLAN

So our goal is to construct a real-world dataset for studying Tor website
fingerprinting that researchers and developers can use to evaluate potential
attacks and to design informed defenses that improve Tor’s resistance to such
attacks. This dataset would enable researchers to use a methodology that does
not have any of the above limitations. We believe that such a dataset will help
us make Tor safer, because it will allow us to design defenses that can be shown
to protect *real* Tor traffic instead of *synthetic* traffic. This would lead to
a better understanding of proposed defenses and enable us to more confidently
decide which, if any, defense is worth deploying in Tor.

The dataset will be constructed from a 13-week exit relay measurement that is
based on the measurement process established in recent work [1]. The primary
information being measured is the directionality of the first 5k cells sent on a
measurement circuit, and a keyed-HMAC of the first domain name requested on the
circuit. We also measure relative circuit and cell timestamps (relative to the
start of measurement). The measurement data is compressed, encrypted using a
public-key encryption scheme (the secret key is stored offline), and then
temporarily written to persistent storage before being securely retrieved from
the relay machine.

We hope that this dataset can become a standard tool that website fingerprinting
researchers and developers can use to (1) accelerate their study of attacks and
defenses, and (2) produce evaluation and results that are more directly
applicable to the Tor network. We plan to share it upon request only to other
researchers who appear to come from verifiable research organizations, such as
students from well-known universities. We will require researchers with whom we
share the data to (1) keep the data private, and (2) direct others who want a
copy of the data to us to mitigate unauthorized sharing.

[0] A Critical Evaluation of Website Fingerprinting Attacks. Juarez et al., CCS 2014. https://www1.icsi.berkeley.edu/~sadia/papers/ccs-webfp-final.pdf

[1] Online Website Fingerprinting: Evaluating Website Fingerprinting Attacks on Tor in the Real World. Cherubin et al., USENIX Security 2022. https://www.usenix.org/conference/usenixsecurity22/presentation/cherubin