Link padding and the intersection attack

Fri Aug 16 10:12:57 UTC 2002

More on the intersection attack...

> > > Scenario 1 also covers "Alice has an account on slashdot.org from which
> > > she writes comments", "Alice uses Freedom to check her nym's mail at
> > > the popserver", and "Alice mails a bunch of reply blocks to a Mixminion
> > > nymserver so it can deliver her mail".
> >
> > In my opinion, this is too hard. Pseudonymity requires an entirely
> > different infrastructure than anonymity (end to end padding). Sorry.
>
> Really? But this is going to be a common use of the system. Or at least,
> it will be unless we explicitly tell people they won't get anonymity if
> they do it. (And even then...)

Yes. My opinion is that you have to do something like Paul proposed to
protect against it. A real time anonymity infrastructure like OR or Tarzan
will not do -- you need to introduce time delay to give everyone a chance
to put themselves into the anonymity set.

> > Scenario 2 -- I agree with. Here is my view on the way the attacker can
> > analyse his observations:
> >
> > He watches Bob and computes anonymity sets every time there is a request
> > coming to her. Suppose they are as follows:
> > {1,2,3,4,5};{1,6,7,8,9};{1,2,6,10},{1,3,10}
> >
> > Now he comuptes all the possible combinations of what could have happened.
> > There are many, eg:
> >
> > 1->Bob;
> > 1->Bob;
> > 1->Bob;
> > 1->Bob; (call this 1^4)
> >
> > Or, 1^2;2;3, or 1^2;10^2. Etc.
> >
> > Now we need to build a probability distribution over these traces.  If you
> > assume that P(X^2) = P(X;Y) where X \= Y, then basically all the possible
> > traces will have identical probabilities. Now you need to convince me that
> > P(X^2) = k*P(X;Y) (X \= Y \and k = *large* constant).
>
> Translating this back into English, you want me to convince you that the
> the chance of Alice going to Bob twice is much higher than the chance
> of Alice going to Bob and Arnold going to Bob. (Is that accurate?)

Yes, that is correct.

> Is this the right way to look at it? I've already argued above that in
> some scenarios, Alice *will* make multiple connections to Bob.

There is a difference. If you *know* that two connections from the same
guy was made, you can just do the intersection of anonymity sets. If you
don't, the above is the way to go.

> Whether there are a bunch of other people talking to Bob, and thus perhaps
> it's much more common to find two unrelated connections to Bob than two
> connections from the same Alice ... I'm not sure that matters. I guess my

????

> point is that we can look at the chances that Arnold (an ordinary guy) is
> in the set of suspects, based on some assumed distribution of recipients
> (I think randomly choosing recipients is a fine first model), and if we
> find an Alice that's in the set more often than she should be, she is more
> suspicious. The more often she's in the set, the more suspicious she is.

She could be talking to *someone else* at the same time. Let me produce a
refinement of the model to try to convince you:
Lets' say we have 4 senders on the internet (A-D) and 4 receivers (1-4)

4 connections occur involving these guys. (no other connection which
occurs beforehand involves A-D, 1-4)

Here are some anonymity sets. They have constraints on them.
Round 1:
1:{ABCD}
2:{ABCD}
3:{BCD}
4:{CD}

Round 2:
3:{ABCD}
4:{ABCD}
1:{BCD}
2:{CD}

Round 3:
1:{ABCD}
2:{ABCD}
3:{BCD}
4:{CD}

Round 4:
2:{ABCD}
4:{ABCD}
1:{BCD}
3:{CD}

Now we compute all the possibilities of who was communicating to whom at
each round, eg:

Round 1:
1->A
2->B
3->C
4->D

Round 2:
1->B
2->C
3->A
4->D

Round 3:
1->A
2->C
3->B
4->D

Round 4:
1->B
2->A
3->C
4->D

In summary, our hypothesis is:
1:A;B;A;B
2:B;C;C;A
3:C;A;B;C
4:D;D;D;D

Now we are going to obtain a probability distribution over those
hypotheses. If we assume that everyone is as likely to communicate to
everyone, then all we need to do is multiply the k_i factors corresponding
to:
1:k_1*k_1*k_2*k_2
2:k_1*k_1*k_2*k_1
3:k_1*k_1*k_1*k_2
4:k_1*k_2*k_3*k_4

Now multiply those number together to obtain how likely your hypothesis
was to have happened. Now, compute this for all possible hypothesis and
make a probability distrubution out of them. (i.e. divide each one by the
sum of all numbers). Now the probability of the above scenario is the
probability that (4->D)^4 or that 4 sends 4 messages to D. Still with me?

Ok, The above uses anonymity sets, you can do it using anonymity
probability distributions as well, as in [SD02]. I demonstrated it here
for simplicity.

> > Note that Scenario 1 is basically knowing that the trace we are looking
> > for is X^4 and deducing that 1^4 is the only possible trace.
>
> It's a bit more complex than that -- we may not see all of the connections
> from Alice, so a 1^3 may be her if there's no 1^4.

The above assumes that you do. (But yes, you can include the null sender
and receiver which means you do not see who Alice sent a message to or Bob
received a message from)

> Does this mean you agree that this is a strong attack? :)

I'll leave you to think about that :-))

> Can you tie the k_i values into something real so we can have a better
> intuition about them?

Well -- if you had a log file for a bunch of servers, you would be able to
estimate k_i's. For email, intuitively, k_i is large -- you are much more
liekly to send an email to someone you have sent an email to before than
just find someone random and sedn them an email. But yes, estimating these
is a slight difficulty -- you need lots of logs.

A.

------------------
Andrei Serjantov
Queens' College
Cambridge CB3 9ET
http://www.cl.cam.ac.uk/~aas23/