Hi Linus, Ian, list,
now that we have bridges running on IPv6 addresses, some bridge operators enabled that feature on their public bridges and published descriptors to the bridge authority.
I wonder how to sanitize these addresses for metrics data. (Currently, lines containing IPv6 addresses are simply discarded in the sanitized output.)
Here's how we sanitize IPv4 addresses (from https://metrics.torproject.org/formats.html#bridgedesc):
Replace IP address with IP address hash: Of course, the IP address needs to be removed, too. It is replaced with 10.x.x.x with x.x.x being the 3 byte output of H(IP address | bridge identity | secret)[:3]. The input IP address is the 4-byte long binary representation of the bridge's current IP address. The bridge identity is the 20-byte long binary representation of the bridge's long-term identity fingerprint. The secret is a 31-byte long secure random string that changes once per month for all descriptors and statuses published in that month. H() is SHA-256. The [:3] operator means that we pick the 3 most significant bytes of the result.
The idea is that it should be hard to derive the original IPv6 address from the sanitized address. At the same time it should be easy to notice whether the address of a given bridge has changed within the same month.
Here's my plan for IPv6 addresses:
- Shorter secret: For the hash function input, use the 16 byte long binary representation of the bridge's current IP address, the 20 bytes of the fingerprint, but only a 19 byte long secure random string that changes once per month. The idea is to keep the input to one SHA block (447 bits) as suggested by Ian on January 2, 2011 on this list: (16 + 20 + 19) * 8 = 440.
- Alternative to shorter secret: Use the same 31 byte long secret and live with the fact that the hash input now spans two SHA blocks. Maybe use a 75 byte long secret to have an input of two SHA blocks.
- Write 3 bytes of the sanitized IPv6 address in [::] notation. We're writing sanitized IPv4 addresses as 10.x.x.x. Is there a counterpart for IPv6 addresses? It should be obvious that these are "private" addresses, but I'd like to keep the notation unchanged to keep parsing tools simple.
- Alternative to using 3 bytes: Should we use fewer or more bytes from the SHA-256 output for IPv6 addresses?
Thanks, Karsten