karsten.loesing at gmx.net
Thu Apr 1 12:04:29 UTC 2010
picking up a discussion from three weeks ago. For those who don't
memorize this discussion (and don't want to read up everything), the
idea was to research bridge stability by looking at sanitized bridge
descriptors and see how long bridges were available after giving them to
Answering Paul's and Christian's mails first, and showing new results below.
On 3/10/10 5:29 PM, Paul Syverson wrote:
> On Wed, Mar 10, 2010 at 03:20:24PM +0100, Karsten Loesing wrote:
>> On 3/10/10 12:39 AM, Roger Dingledine wrote:
>>> On Mon, Feb 15, 2010 at 09:05:54PM +0100, Karsten Loesing wrote:
>>>>>> However, I cannot take
>>>>>> changing IP addresses into account for this analysis, because I removed
>>>>>> the IP addresses when sanitizing the bridge descriptors.
>>>>> What's the process by which we sanitize them? It seems that a fine
>>>>> solution would be to hash the IP addresses keyed with a secret that
>>>>> remains constant across the hashes. So you could tell if the IP addresses
>>>>> are the same without being able to tell what they are. The main challenge
>>>>> there is keeping the secret somewhere secret in between batches (and
>>>>> maybe rotating the secret monthly, for some level of forward secrecy).
>>>> Yes, we can do something like that. I assume that it'll keep my server
>>>> busy for a day or two to parse all the descriptors once more. But I can
>>>> do that.
>>>> Instead of the secret input to the hash function, how about we
>>>> concatenate bridge identity and IP address as input? Note that we
>>>> don't put the bridge identity in the sanitized descriptor, but only
>>>> its hash. That way we'd avoid using a secret that we'll lose or forget
>>>> anyway and have something reproducible. To be precise, this is what I
>>>> have in mind:
>>>> sanitized bridge identity = H(bridge identity)
>>>> sanitized IP address = H(bridge identity + IP address)[:4]
>>> Interesting idea. This approach clearly does leak more information:
>>> if you learn the bridge identity at any point, you can guess-and-check
>>> past IP addresses for the bridge.
>>> The next question is then: so what? Is that something we want to protect?
>> A fine question. I don't think this is something we want to protect. My
>> understanding of bridges is that they shall make it hard for an
>> adversary to block the entry points to the Tor network. That means we
>> shouldn't reveal current bridge IP addresses, nor bridge identities
>> which can be used to learn about current and future IP addresses.
>> But why should we care about past IP addresses of a bridge? What would
>> the adversary---who learns about a bridge identity somehow---do with
>> this piece of information? Tell that someone has been using Tor via this
>> bridge in the past when connecting to that IP address? Is this something
>> we want to protect? That would imply that it's considered a security
>> feature that bridges change their IP addresses on a regular basis. What
>> about bridges on static IP addresses: when an adversary learns about
>> such a bridge, does that mean its past users are more screwed than the
>> past users of a bridge on a dynamic IP address?
> Much of our motivation for using Tor is because you don't know what
> behavior you need to protect so be cautious. So similarly and purely
> speculatively, this means that bridges which are run by people who
> moslty didn't want to run public nodes but wanted to help would now
> have a public permanently confirmable record connected to something in
> the outside world. They weren't signing up expecting to have forward
> anonymity in any robust sense (at least I hope not), but without the
> record if they run a bridge (say from a static IP that they hold for
> an indefinite time) and then decide not to later, unless someone
> recorded that bridge usage at the time there is no public record of
> their participation. So it's a commitment that is less permanent hence
> less scary. If at some point in the future someone finds it useful to
> go through and look for IP addresses that have run bridges for
> whatever currently unimagined nefarious purpose, then it's better if
> that is not available. I'm not saying this trumps using hashed salted,
> etc. addresses in some publicly listed directory info for any reason,
> not even to compare it to the uses mentioned below. But you asked, so
> I tried to come up with an answer.
Okay, I agree with you in that we should keep the IP addresses private.
I have changed the sanitizing process for two months of data as written
in my earlier mail. Specifically, I'm replacing IP addresses with
H(IP address + bridge identity + secret)[:4]
The resulting "IP address" helps us detect whether a specific bridge has
changed its IP address, but it shouldn't reveal anything else.
Is everyone happy with this approach? If yes, I'll make the tarballs
with December 2009 and January 2010 sanitized bridge descriptors
available after the weekend (April 6).
>> The question is: What are we trying to protect? I'm happy to protect
>> past IP addresses of a bridge if there's a reason to do so. But knowing
>> what is worth protecting and what is not would be helpful. After all,
>> not publishing any bridge descriptors would give us best protection; but
>> that's not what we want.
>>> There are two benefits to leaking this information. First, we can generate
>>> incremental updates to the sanitized bridge descriptor database, and
>>> they will be compatible sanitized-IP-address-wise with the existing
>>> database. That makes updates more convenient on our side.
>> Yes, not including a monthly changing secret in the hash function makes
>> the sanitized descriptors more useful for statistics.
>>> Second, it is
>>> possible to ask questions about where bridges have been over the space
>>> of months, not just inside a given month. It's not clear that we plan
>>> to ask those questions right now, though.
>> Unclear. I don't think we'll be asking these questions.
>>> So the conclusion is either "A) yes, we should do it that way, the
>>> information leak is not a big deal", or "B) let's do it the safer way for
>>> now, to get the answers we are looking for now; and if later we decide we
>>> want more detailed answers, we still have the original bridge descriptors,
>>> and we can publish slightly less sanitized data at the point we decide
>>> we should".
>>> I'm not sure there's a clear answer, but my instinct is to go for B.
>> Okay. I went for B by taking the hash of the bridge's IP address plus a
>> fixed secret string that I use for all bridges. I'm still hesitant to
>> publish these descriptors, though. We might be giving away too much by
>> including the bridge's country code (which can be a country with only
>> very few IP addresses) plus H(IP address + secret)[:4]. Maybe we should
>> do H(IP address + bridge identity + secret)[:4] or something.
>> In any case, I'm tempted not to update all the sanitized bridge
>> descriptors, but only those for December 2009 and January 2010 which I'm
>> using in the bridge-stability analysis. (I pondered using some 2008
>> descriptors, but they aren't as meaningful for the current bridge
>> stability situation.) How about I do the H(IP address + bridge identity
>> + secret)[:4] thing and make these two tarballs available?
>>>> Note that only the first 4 bytes of the result are used, because the
>>>> result is written as the bridge's IP address, covering the entire range
>>>> between 0.0.0.0 and 255.255.255.255. Of course, there's a reasonable
>>>> chance for collisions for a bridge identity with two different IP
>>> Right -- the birthday paradox brings us to "once we've looked at 65k
>>> addresses, we should expect a collision".
>> Should be fine. Even if such a collision happens, it doesn't
>> significantly affect the analysis result.
>>>> But I want the network status to contain all relevant
>>>> information rather than re-assembling network status entries and bridge
>>>> descriptors (which could contain more information in their contact
>>>> line). Are there better ways to add 20 bytes to the network status? We
>>>> might still add the full hash to the descriptor's contact line.
>>> So far we've been trying to make sure that the sanitized descriptors
>>> we publish still happen to conform to dir-spec.txt. At some point this
>>> technique is going to break down. We shouldn't be too afraid to abandon
>>> that technique when it gets too burdensome, so long as we still give
>>> people tools that can parse whatever format we publish.
>> True. So far it works okay. I'm trying to conform to dir-spec.txt as
>> long as possible. The tools I'm giving to people should already be less
>> complex, not more.
On 3/10/10 7:19 PM, Christian Fromme wrote:
> Hi Karsten,
> First of all, nice analysis!
> On Mon, Feb 15, 2010 at 9:29 AM, Karsten Loesing
> <karsten.loesing at gmx.net> wrote:
>> So, are these good news? Personally, I had expected worse results. During most of the time, availability is surprisingly high. An 80% chance of the bridges working even after 96 hours seems fair. That means in 1 out of 5 cases someone needs to send a second e-mail or make a second website request. We might even think (or have already thought) about implementing a bridge update functionality where users go to the bridge authority and exchange their broken bridges for working ones---as long as at least one of their bridges still works.
> Would it be possible for someone to tell the bridge authority that a
> bridge is down even though it is not? Maybe through a DoS attack? If
> this is the case, maybe this thing should be handled with care or
> otherwise there's a paranoid theoretical way someone could learn all
> bridge addresses.
> A way around that, if it is considered realistic enough, would be to
> tag some bridges 'private' or so and not to give those out with that
> update functionality.
A valid concern. I guess that only a subset of all bridges would be
distributed this way. But I don't want to enter the (interesting) field
of bridge distribution methods here.
Okay, here are the new results, this time with taking IP address changes
into account. The graph shows the fraction of random bridge subsets
(containing 3 bridges, 1 of them running on port 443) that had at least
1 bridge running continuously throughout 6/24/48/96 hours. For example,
a value of 0.9 on the y axis means that 90% of all samples taken at the
date and time on the x axis were useful for another 6/24/48/96 hours.
For comparison between the analysis with and without IP address changes,
here are the two 96 hours lines for both analyses. The red line in the
following graph is equivalent to the purple line in the previous graph.
So, yes, when we take IP address changes into account, bridge stability
is worse than we thought from the first analysis. But still, it's not as
bad as one might have imagined. Note that the reasons for the drops on
December 11, 23, and 31 are probably problems with the bridge authority,
not with all bridges (I could make the analysis more precise by looking
at self-reported bridge uptimes, if required). That means that the 4
days before those drops are probably too low in the graphs. In this
case, the 96 hours line is between 60% and 90%. Or to rephrase that, in
3 out of 4 cases, people had a working bridge 4 days after requesting
bridge addresses, and the 4th person would have to request another set
I'd like to hear what others say about these results. Am I missing
something? Is "3 out of 4" a horrible result?
More information about the tor-dev