[tor-project] Constructing a real-world dataset for studying website fingerprinting

Mon Apr 24 22:40:13 UTC 2023

On 24/04/2023 16:17, Jansen, Robert G CIV USN NRL (5543) Washington DC
(USA) via tor-project wrote:
>> Do you have any approaches or thoughts around pruning the dataset
>> or further refining labels beyond the first domain?
> 
> I should note that we will include both the HMAC of first domain and
> an HMAC of its shortest private suffix (computed using Mozilla’s
> public suffix list and libpsl (publicsuffix.org
> <http://publicsuffix.org/>)).
> 
> IMO, the first domain is by far the most important. Other domains
> accessed on a typical TB circuit will primarily be due to fetching
> the embedded objects fetched during page loads. We’ll get a new
> circuit if the domain in the URL bar changes, and that will have its
> own first domain.
> 
> I’m trying to collect the minimal thing here to balance privacy and
> utility; while not perfect from a ML “give me all the data”
> perspective, I think the dataset would allow us to make significant
> moves in the right direction.

Cool with the suffix addition! The public suffix list is a good source. 
Agree that the first domain is by far the most important. Some noisy 
function over the later domains could provide plenty of utility with few 
downsides though.

For example, a less cumbersome approach could be a bucketed count of the 
number of *distinct* domains looked up in each trace (0-5, 6-10, 11-15, 
15+). That, together with the traces, would allow some more informative 
labels. It's already known that the difference between standard vs. 
safest TB (aka javascript) is significant, and it'd be great to make the 
heuristics more solid for labeling.

> First, I don’t think “junk” will be anywhere near the problem you
> imagine. Yes, there will be some noise from non-TB activity. But I
> believe that the *overwhelming* activity is people using TB going to
> top websites (see our IMC 2018 paper [0]). Second, you will be able
> to count the number of circuits that exist for each domain key. While
> this may not exactly match up with Tranco, it will probably be pretty
> close [0]. Third, I’m very happy to introduce this noise into the WF
> evaluation process. The adversary has to deal with it at least during
> testing, so you should too :)
> 
> [0] https://www.robgjansen.com/publications/torusage-imc2018.pdf

I'm familiar with your IMC paper, it's nice work! Are you inferring TB 
over non-TB usage from the observed torproject.org primary domain spam 
or something else in the paper that enables you to differentiate? The 
ratio of TB traffic in the proposed dataset would be fantastic to know, 
linked to traces even better.

The issue isn't dealing with the junk traffic in testing, that's great, 
it's being able to deal with it properly in training without having to 
repeat the data collection. It's also a plus to reason more about 
testing results if we can tell more about the traces.

How much % of exit bandwidth are you planning to use to collect for 13 
weeks? Even with a 0.1% exit probability over 13 weeks, extrapolating 
from your IMC paper, there should be 1000+ samples for Tranco top-1k. 
This is great if mostly TB traffic.

>> - Filled with an unknown rate of junk, the dataset alone will be
>> insufficient to train WF attacks in noisy environments (you want
>> ~1000+ samples per class of something coherent, at least with
>> current sota DL models pushed to their limits: as we move on to
>> transformers probably even more). Suppose you subscribe to the
>> claims of data staleness being a factor. In that case, researchers
>> cannot even go through the very consuming process of collecting
>> adequate labeled training data in the same way to show success on
>> the proposed dataset. The exit collection vantage point makes this
>> worse.
>> 
> 
> We’ll have enough data that concept drift can be studied within our
> set exclusively. If our set isn’t big enough, or if it becomes stale
> after a while and we’re wanting more realistic threat estimates with
> current data, then yeah that’s another TRSB submission and we’ll have
> to think through the risks and relative benefits again. As it is now,
> we have a very large step forward in terms of benefit because we are
> starting from zero.

Without adequate labeling I have doubts, at least with current attacks. 
Guess it'll trigger more research.

> If we determine with this first set that we really need a continuous
> stream of the latest data and lots more of it, and want to establish
> some permanent measurement setup, that of course would be useful but
> also has its own risks. Right now I would argue against a permanent
> measurement because we haven’t even done the more basic thing yet.

The dataset you propose is already massive, it's not about time. More 
refined labeling please!

>> - Using the dataset as a basis for simulating defenses, one would
>> have to simulate the corresponding client and middle traces to feed
>> into the simulated defenses based on exit traces. Kinda messy with
>> poor signalling for Tor-network characteristics in the dataset.
>> When collecting at the client, getting realistic client traces (for
>> the particular configuration) is basically for free. At the same
>> time, middles change for every circuit, so a half-assed approach
>> seems to get you far (speaking from experience of half-assing
>> things here!). I worry about half-assing both client and middle
>> traces, however.
> 
> This is solvable. You could construct a website simulator using the
> downstream cell streams, which we collect on the exit before Tor’s
> queuing plays a part in messing with the timing. Once you have that,
> use Shadow to fetch the simulated sites through a simulated Tor
> network, which should do a reasonable (and getting better) job of
> adding back in the performance effects on top of the cell streams.
> Not perfect, but maybe passes your half-ass bar?

Cool, maybe it's good enough? A plus is that one could simulate many 
more traces at clients and middles. For most destinations/circuits, I 
guess the destination<->exit latency is significantly less than 
client<->exit, so it should be possible to simulate blocking defenses 
more accurately. It's not a replacement for implementations and 
real-defended datasets though, I hope we can agree on?

> 
>> If we make this proposed dataset and its method the bar of doing
>> "real-world WF", it might lead to too high of a bar. We want more
>> real-world implementations and data collection with defenses, not
>> less, I think.
> 
> Yes, I absolutely want us to raise the bar for doing this type of
> research. If that means professors can write fewer papers, and the
> ones we get will be more informative, then I will have succeeded!
> 

Please think of the professors and phd students! ;) Attacks are already 
closed-world perfect on unprotected data, so the dataset will surely 
help trigger more publishing on attacks. I'm mostly concerned about the 
bar for defenses and the community emphasizing simulation more than 
implementation as a consequence. We want more implemented and ultimately 
deployed defenses, right?

>> Sorry if the above may come across as a bit negative, it's not my
>> intent: I *want* the dataset you describe, we can learn a lot from
>> it for sure. Wish I had a chance to chat in person in Costa Rica!
>> Please don't feel obliged to reply, just food for thought.
>> 
> 
> How could I resist in responding? :)
> 
> I really do appreciate your thoughts, thanks for sharing!

Thanks, likewise. I was thinking the beach would be nicer! ;)

Best,
Tobias

> 
> Peace, love, and positivity, Rob
> 
>> Best, Tobias
>> 
>> [0]: "SoK: A Critical Evaluation of Efficient Website
>> Fingerprinting Defenses",
>> https://www-users.cse.umn.edu/~hoppernj/sok_wf_def_sp23.pdf [1]:
>> "Website Fingerprinting with Website Oracles",
>> https://petsymposium.org/2020/files/papers/issue1/popets-2020-0013.pdf
>>
>>
>> 
On 20/04/2023 23:16, Jansen, Robert G CIV USN NRL (5543) Washington DC 
(USA) via tor-project wrote:
>>> Hello Tor friends, We are planning to construct a real-world
>>> dataset for studying Tor website fingerprinting that researchers
>>> and developers can use to evaluate potential attacks and to
>>> design informed defenses that improve Tor’s resistance to such 
>>> attacks. We believe the dataset will help us make Tor safer,
>>> because it will allow us to design defenses that can be shown to
>>> protect *real* Tor traffic instead of *synthetic* traffic. This
>>> will help ground our evaluation of proposed defenses in reality
>>> and help us more confidently decide which, if any, defense is
>>> worth deploying in Tor. We have submitted detailed technical
>>> plans for constructing a dataset to the Tor Research Safety Board
>>> and after some iteration have arrived at a plan in which we
>>> believe the benefits outweigh the risks. We are now sharing an
>>> overview of our plan with the broader community to provide an
>>> opportunity for comment. More details are below. Please let us
>>> know if you have comments. Peace, love, and positivity, Rob P.S.
>>> Apologies for posting near the end of the work-week, but I wanted
>>> to get this out in case people want to talk to me about it in
>>> Costa Rica. === BACKGROUND Website fingerprinting attacks distill
>>> traffic patterns observed between a client and Tor entry into a
>>> sequence of packet directions: -1 if a packet is sent toward the
>>> destination, +1 if a packet is sent from toward the client. An 
>>> attacker can collect a list of these directions and then train
>>> machine learning classifiers to associate a website domain name
>>> or url with the particular list of directions observed when
>>> visiting that website. Once it does this training, then when it
>>> observes a new list of directions it can use the trained model
>>> to predict which website corresponds to that pattern. For
>>> example, suppose [-1,-1,+1,+1] is associated with website1 and
>>> [-1,+1,-1,+1] is associated with website2. There are two steps in
>>> an attack: Step 1: In the first step the attacker itself visits
>>> website1 and website2 many times and learns: [-1,-1,+1,+1] ->
>>> website1 [-1,+1,-1,+1] -> website2 It trains a machine learning
>>> model to learn this association. Step 2: In the second step, with
>>> the trained model in hand, the attacker monitors a Tor client
>>> (maybe the attacker is the client’s ISP, or some other entity in
>>> a position to observe a client’s traffic) and when it observes
>>> the pattern: [-1,-1,+1,+1] the model will predict that the client
>>> went to website1. This example is *extremely* simplified, but I
>>> hope gives an idea how the attack works. PROBLEM Because
>>> researchers don’t know which websites Tor users are visiting,
>>> it’s hard to do a very good job creating a representative dataset
>>> that can be used to accurately evaluate attacks or defenses
>>> (i.e., to emulate steps 1 and 2). The standard technique has been
>>> to just select popular websites from top website lists (e.g.,
>>> Alexa or Tranco) and then set up a Tor webpage crawler to visit
>>> the front-pages of those websites over and over and over again.
>>> Then they use that data to write papers. This approach has
>>> several problems: - Low traffic diversity: Tor users don’t only
>>> visit front-pages. For example, they may conduct a web search and
>>> then click a link that brings them directly to an internal page
>>> of a website. The patterns produced from front-page visits may be
>>> simpler and unrepresentative of the patterns that would be
>>> observed from more complicated internal pages. - Low browser
>>> diversity: It has been shown by research from Marc Juarez [0]
>>> and others that webpage crawlers used by researchers lack
>>> diversity in important aspects that cause us to overestimate the
>>> accuracy of WF attacks. For example, the browser versions,
>>> configuration choices, variation in behavior (e.g., using 
>>> multiple tabs at once), and network location of the client can
>>> all significantly affect the observable traffic patterns in ways
>>> that a crawler methodology does not capture. - Data staleness:
>>> Researchers collect data over a short time-frame and then 
>>> evaluate the attacks assuming this static dataset. In the real
>>> world, websites are being updated over time, and a model trained
>>> on an old version of a website may not transfer to the new
>>> version. In addition to the above problems in methodology,
>>> current research also causes incidental consequences for the Tor
>>> network: - Network overhead: machine learning is a hot topic and
>>> several research groups have crawled tens of thousands of
>>> websites over Tor many times each. While each individual page
>>> load might be insignificant compared with the normal usage of 
>>> Tor, crawling does add additional load to the network and can
>>> contribute to congestion and performance bottlenecks. Researchers
>>> have been designing attacks that are shown to be extremely
>>> accurate using the above synthetic crawling methodology. But
>>> because of the above problems, we don’t properly understand the
>>> *true* threat of the attack against the Tor network. It is
>>> possible that the simplicity of the crawling approach is what
>>> makes the attacks work well, and that the attacks would not work
>>> as well if evaluated with more realistic traffic and browser
>>> diversity. PLAN So our goal is to construct a real-world dataset
>>> for studying Tor website fingerprinting that researchers and
>>> developers can use to evaluate potential attacks and to design
>>> informed defenses that improve Tor’s resistance to such attacks.
>>> This dataset would enable researchers to use a methodology that
>>> does not have any of the above limitations. We believe that such
>>> a dataset will help us make Tor safer, because it will allow us
>>> to design defenses that can be shown to protect *real* Tor
>>> traffic instead of *synthetic* traffic. This would lead to a
>>> better understanding of proposed defenses and enable us to more
>>> confidently decide which, if any, defense is worth deploying in
>>> Tor. The dataset will be constructed from a 13-week exit relay
>>> measurement that is based on the measurement process established
>>> in recent work [1]. The primary information being measured is the
>>> directionality of the first 5k cells sent on a measurement
>>> circuit, and a keyed-HMAC of the first domain name requested on
>>> the circuit. We also measure relative circuit and cell timestamps
>>> (relative to the start of measurement). The measurement data is
>>> compressed, encrypted using a public-key encryption scheme (the
>>> secret key is stored offline), and then temporarily written to
>>> persistent storage before being securely retrieved from the relay
>>> machine. We hope that this dataset can become a standard tool
>>> that website fingerprinting researchers and developers can use to
>>> (1) accelerate their study of attacks and defenses, and (2)
>>> produce evaluation and results that are more directly applicable
>>> to the Tor network. We plan to share it upon request only to
>>> other researchers who appear to come from verifiable research
>>> organizations, such as students from well-known universities. We
>>> will require researchers with whom we share the data to (1) keep
>>> the data private, and (2) direct others who want a copy of the
>>> data to us to mitigate unauthorized sharing. [0] A Critical
>>> Evaluation of Website Fingerprinting Attacks. Juarez et al., CCS
>>> 2014.
>>> https://www1.icsi.berkeley.edu/~sadia/papers/ccs-webfp-final.pdf 
>>> [1] Online Website Fingerprinting: Evaluating Website
>>> Fingerprinting Attacks on Tor in the Real World. Cherubin et al.,
>>> USENIX Security 2022.
>>> https://www.usenix.org/conference/usenixsecurity22/presentation/cherubin
>>>
>>> 
_______________________________________________
>>> tor-project mailing list tor-project at lists.torproject.org 
>>> https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-project
>>
>>> 
_______________________________________________
>> tor-project mailing list tor-project at lists.torproject.org 
>> https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-project
> 
> _______________________________________________ tor-project mailing
> list tor-project at lists.torproject.org 
> https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-project