arma at mit.edu
Tue Mar 9 23:39:55 UTC 2010
On Mon, Feb 15, 2010 at 09:05:54PM +0100, Karsten Loesing wrote:
> >> However, I cannot take
> >> changing IP addresses into account for this analysis, because I removed
> >> the IP addresses when sanitizing the bridge descriptors.
> > What's the process by which we sanitize them? It seems that a fine
> > solution would be to hash the IP addresses keyed with a secret that
> > remains constant across the hashes. So you could tell if the IP addresses
> > are the same without being able to tell what they are. The main challenge
> > there is keeping the secret somewhere secret in between batches (and
> > maybe rotating the secret monthly, for some level of forward secrecy).
> Yes, we can do something like that. I assume that it'll keep my server
>busy for a day or two to parse all the descriptors once more. But I can
> Instead of the secret input to the hash function, how about we
>concatenate bridge identity and IP address as input? Note that we
>don't put the bridge identity in the sanitized descriptor, but only
>its hash. That way we'd avoid using a secret that we'll lose or forget
>anyway and have something reproducible. To be precise, this is what I
>have in mind:
> sanitized bridge identity = H(bridge identity)
> sanitized IP address = H(bridge identity + IP address)[:4]
Interesting idea. This approach clearly does leak more information:
if you learn the bridge identity at any point, you can guess-and-check
past IP addresses for the bridge.
The next question is then: so what? Is that something we want to protect?
There are two benefits to leaking this information. First, we can generate
incremental updates to the sanitized bridge descriptor database, and
they will be compatible sanitized-IP-address-wise with the existing
database. That makes updates more convenient on our side. Second, it is
possible to ask questions about where bridges have been over the space
of months, not just inside a given month. It's not clear that we plan
to ask those questions right now, though.
So the conclusion is either "A) yes, we should do it that way, the
information leak is not a big deal", or "B) let's do it the safer way for
now, to get the answers we are looking for now; and if later we decide we
want more detailed answers, we still have the original bridge descriptors,
and we can publish slightly less sanitized data at the point we decide
I'm not sure there's a clear answer, but my instinct is to go for B.
> Note that only the first 4 bytes of the result are used, because the
>result is written as the bridge's IP address, covering the entire range
>between 0.0.0.0 and 255.255.255.255. Of course, there's a reasonable
>chance for collisions for a bridge identity with two different IP
Right -- the birthday paradox brings us to "once we've looked at 65k
addresses, we should expect a collision".
> But I want the network status to contain all relevant
>information rather than re-assembling network status entries and bridge
>descriptors (which could contain more information in their contact
>line). Are there better ways to add 20 bytes to the network status? We
>might still add the full hash to the descriptor's contact line.
So far we've been trying to make sure that the sanitized descriptors
we publish still happen to conform to dir-spec.txt. At some point this
technique is going to break down. We shouldn't be too afraid to abandon
that technique when it gets too burdensome, so long as we still give
people tools that can parse whatever format we publish.
More information about the tor-dev