Publishing sanitized bridge descriptors

Karsten Loesing karsten.loesing at gmx.net
Tue Nov 10 17:06:26 UTC 2009


Hi everyone,

I'm planning to publish a sanitized version of the bridge descriptors  
that our bridge authority Tonga gathers. The general idea behind this  
is to make all data public that we gather for statistical purposes.  
There are several reasons for doing so: transparency towards our  
community, restricting ourselves to gathering only those statistics  
that we think are safe to make public, allowing others to do the same  
research as we do, etc.

The bridge descriptors contain IP addresses and other contact  
information of bridges that we don't want to give away. Doing so would  
defeat the purpose of bridges, after all.

Here are the steps that we're taking to remove all potentially  
sensitive information from bridge descriptors before publication:

1. Replace the bridge identity with its SHA1 value

Clients can request a bridge's current descriptor by sending its  
identity string to the bridge authority. This is a feature to make  
bridges on dynamic IP addresses useful. Therefore, the original  
identities (and anything that could be used to derive them) need to be  
removed from the descriptors. The bridge identity is replaced with its  
SHA1 hash value. The idea is to have a consistent replacement that  
remains stable over months or even years (without keeping a secret for  
a keyed hash function).

2. Remove all cryptographic keys and signatures

It would be straightforward to learn about the bridge identity from  
the bridge's public key. Replacing keys by newly generated ones seemed  
to be unnecessary (and would involve keeping a state over months/ 
years), so that all cryptographic objects have simply been removed.

3. Replace IP address with 127.0.0.1

Of course, the IP address needs to be removed, too. However, the IP  
address is resolved to a country code first and the result written to  
the contact line as "somebody at example dot de" for Germany, etc. The  
ports are kept unchanged though.

4. Replace contact information

If there is contact information in a descriptor, the contact line is  
changed to "somebody at ...". If there is none, a contact line is  
added saying "nobody at ..." in order to put in the country code.

5. Replace nickname with Unnamed

The bridge nicknames might give hints on the location of the bridge if  
chosen without care; e.g. a bridge nickname might be very similar to  
the operators' relay nicknames which might be located on adjacent IP  
addresses. All bridge nicknames are therefore replaced with the string  
Unnamed.

Note that these processing steps only prevent people from learning  
about new bridge locations. People who already know a bridge identity  
or location can easily learn more about this bridge from the sanitized  
descriptors. This is useful for statistical analysis, e.g. to filter  
out bridges that have been running as relays before.

The Java application that does all the parsing, replacing, and  
rewriting can be found here:

https://tor-svn.freehaven.net/svn/projects/archives/trunk/bridge-desc-sanitizer/

Here is a sample of the bridge descriptors of October 2008 (not 2009,  
in case there turn out to be sensitive parts in there):

http://freehaven.net/~karsten/volatile/bridges-2008-10.tar.bz2    (4.6  
MB)

Are there any sensitive parts in that tarball that we don't want to  
publish?

Thanks,
--Karsten



More information about the tor-dev mailing list