Bridge stability

Wed Mar 10 14:20:24 UTC 2010

On 3/10/10 12:39 AM, Roger Dingledine wrote:
> On Mon, Feb 15, 2010 at 09:05:54PM +0100, Karsten Loesing wrote:
>>>> However, I cannot take
>>>> changing IP addresses into account for this analysis, because I removed
>>>> the IP addresses when sanitizing the bridge descriptors.
>>>
>>> What's the process by which we sanitize them? It seems that a fine
>>> solution would be to hash the IP addresses keyed with a secret that
>>> remains constant across the hashes. So you could tell if the IP addresses
>>> are the same without being able to tell what they are. The main challenge
>>> there is keeping the secret somewhere secret in between batches (and
>>> maybe rotating the secret monthly, for some level of forward secrecy).
>>
>> Yes, we can do something like that. I assume that it'll keep my server
>> busy for a day or two to parse all the descriptors once more. But I can
>> do that.
>>
>> Instead of the secret input to the hash function, how about we
>> concatenate bridge identity and IP address as input? Note that we
>> don't put the bridge identity in the sanitized descriptor, but only
>> its hash. That way we'd avoid using a secret that we'll lose or forget
>> anyway and have something reproducible. To be precise, this is what I
>> have in mind:
>>
>>   sanitized bridge identity = H(bridge identity)
>>
>>   sanitized IP address = H(bridge identity + IP address)[:4]
> 
> Interesting idea. This approach clearly does leak more information:
> if you learn the bridge identity at any point, you can guess-and-check
> past IP addresses for the bridge.
> 
> The next question is then: so what? Is that something we want to protect?

A fine question. I don't think this is something we want to protect. My
understanding of bridges is that they shall make it hard for an
adversary to block the entry points to the Tor network. That means we
shouldn't reveal current bridge IP addresses, nor bridge identities
which can be used to learn about current and future IP addresses.

But why should we care about past IP addresses of a bridge? What would
the adversary---who learns about a bridge identity somehow---do with
this piece of information? Tell that someone has been using Tor via this
bridge in the past when connecting to that IP address? Is this something
we want to protect? That would imply that it's considered a security
feature that bridges change their IP addresses on a regular basis. What
about bridges on static IP addresses: when an adversary learns about
such a bridge, does that mean its past users are more screwed than the
past users of a bridge on a dynamic IP address?

The question is: What are we trying to protect? I'm happy to protect
past IP addresses of a bridge if there's a reason to do so. But knowing
what is worth protecting and what is not would be helpful. After all,
not publishing any bridge descriptors would give us best protection; but
that's not what we want.

> There are two benefits to leaking this information. First, we can generate
> incremental updates to the sanitized bridge descriptor database, and
> they will be compatible sanitized-IP-address-wise with the existing
> database. That makes updates more convenient on our side.

Yes, not including a monthly changing secret in the hash function makes
the sanitized descriptors more useful for statistics.

> Second, it is
> possible to ask questions about where bridges have been over the space
> of months, not just inside a given month. It's not clear that we plan
> to ask those questions right now, though.

Unclear. I don't think we'll be asking these questions.

> So the conclusion is either "A) yes, we should do it that way, the
> information leak is not a big deal", or "B) let's do it the safer way for
> now, to get the answers we are looking for now; and if later we decide we
> want more detailed answers, we still have the original bridge descriptors,
> and we can publish slightly less sanitized data at the point we decide
> we should".
> 
> I'm not sure there's a clear answer, but my instinct is to go for B.

Okay. I went for B by taking the hash of the bridge's IP address plus a
fixed secret string that I use for all bridges. I'm still hesitant to
publish these descriptors, though. We might be giving away too much by
including the bridge's country code (which can be a country with only
very few IP addresses) plus H(IP address + secret)[:4]. Maybe we should
do H(IP address + bridge identity + secret)[:4] or something.

In any case, I'm tempted not to update all the sanitized bridge
descriptors, but only those for December 2009 and January 2010 which I'm
using in the bridge-stability analysis. (I pondered using some 2008
descriptors, but they aren't as meaningful for the current bridge
stability situation.) How about I do the H(IP address + bridge identity
+ secret)[:4] thing and make these two tarballs available?

>> Note that only the first 4 bytes of the result are used, because the
>> result is written as the bridge's IP address, covering the entire range
>> between 0.0.0.0 and 255.255.255.255. Of course, there's a reasonable
>> chance for collisions for a bridge identity with two different IP
>> addresses.
> 
> Right -- the birthday paradox brings us to "once we've looked at 65k
> addresses, we should expect a collision".

Should be fine. Even if such a collision happens, it doesn't
significantly affect the analysis result.

>> But I want the network status to contain all relevant
>> information rather than re-assembling network status entries and bridge
>> descriptors (which could contain more information in their contact
>> line). Are there better ways to add 20 bytes to the network status? We
>> might still add the full hash to the descriptor's contact line.
> 
> So far we've been trying to make sure that the sanitized descriptors
> we publish still happen to conform to dir-spec.txt. At some point this
> technique is going to break down. We shouldn't be too afraid to abandon
> that technique when it gets too burdensome, so long as we still give
> people tools that can parse whatever format we publish.

True. So far it works okay. I'm trying to conform to dir-spec.txt as
long as possible. The tools I'm giving to people should already be less
complex, not more.

Thanks!
--Karsten