Filename: 298-canonical-families.txt
Title: Putting family lines in canonical form
Author: Nick Mathewson
Created: 31-Oct-2018
Status: Open
Target: 0.3.6.x
1. Introduction
With ticket #27359, we begin encoding microdescriptor families in
memory in a reference-counted form, so that if 10 relays all list the
same family, their family only needs to be stored once. For large
families, this has the potential to save a lot of RAM -- but only if
the families are the same across those relays.
Right now, family lines are often encoded in different ways, and
placed into consensuses and microdescriptor lines in whatever format
the relay reported.
This proposal describes an algorithm that authorities should use
while voting to place families into a canonical format.
This algorithm is forward-compatible, so that new family line formats
can be supported in the future.
2. The canonicalizing algorithm
To make a the family listed in a router descriptor canonical:
For all entries of the form $hexid=name or $hexid~name, remove
the =name or ~name portion.
Remove all entries of the form $hexid, where hexid is not 40
hexadecimal characters long.
If an entry is a valid nickname, put it into lower case.
If an entry is a valid $hexid, put it into upper case.
If there are any entries, add a single $hexid entry for the relay
in question, so that it is a member of its own family.
Sort all entries in lexical order.
Remove duplicate entries.
Note that if an entry is not of the form "nickname", "$hexid",
"$hexid=nickname" or "$hexid~nickname", then it will be unchanged:
this is what makes the algorithm forward-compatible.
3. When to apply this algorithm
We allocate a new consensus method number. When building a consensus
using this method or later, before encoding a family entry into a
microdescriptor, the authorities should apply the algorithm above.
Relay MAY apply this algorithm to their own families before
publishing them. Unlike authorities, relays SHOULD warn about
unrecognized family items.