Hi David!
A feature request based on (what I understand of) your hackweek project:
I often find myself in the position where I have a set of fingerprints,
and I want to know if I've "got them all". So something that inputs a
set of fingerprints, and outputs an ordered list of candidate "similar"
relays, would be a great tool.
What exactly we mean by similar, of course, will be a thing we need to
iterate on. If we had a huge dataset -- with ground truth -- we could
machine learning it and see how the black box gods answer. But while
that thought leads to "good idea, let's collect that data set", we need
something else in the mean time.
Here's a start:
* Component-wise similarity is computed differently for each component
(version/contactinfo/nickname is a string comparison, IP address is
looking for other relays on the same /x network, uptime is computing
the real uptime based on the descriptor timestamp minus the descriptor
uptime, etc),
* But the *strength* of a match is based on how few other relays
also matched. That is, we find a component match less compelling the
more relays there are that match. For example, maybe the /16 address
similarity is a weak match, because there are many relays on that same
/16, but the /24 address similarity is a strong match because there's
only one other relay closeby.
* We can start out with simple comparisons, but eventually we'll want
fuzzy comparisons, where e.g. we look at the Levenshtein distance for
nickname ("Anonymous1, Anonymous2, Anonymous3"), or check if the two
uptimes are within a threshold percentage of each other, etc.
* And then the overall similarity is a function of the strength of
each component-wise similarity. To start, we could just have it be
a linear combination of each, with equal weighting, which means the
more components have a strong match, the greater the overall similarity
score. There's a lot of opportunity for improvement here, like, weighting
certain components more, or non-linear weights where e.g. it's especially
interesting when both bandwidthrate and bandwidthburst are strong matches.
* So that's how we decide if *two* relays are similar to each other. But
if I give you *ten* relays, we need a way to figure out what those relays
have in common, to decide whether an eleventh relay is related to the
group as a whole. The easy version here is to just do pairwise comparisons
between each input relay and each consensus relay, and then sum the
overall similarities into a score for each candidate. This approach
should *implicitly* take into account similarity between input relays,
on the theory that if A is similar to C and B is similar to C, then A
is probably similar to B too. For extra credit, we could do some smarter
combination than just summing them, but I don't have anything in mind yet.
For comparison, check out phw's trnnr tool from the SybilHunter paper:
https://github.com/NullHypothesis/trnnr/blob/master/trnnr.py
which does a simplified version of the above (it only takes in one relay,
it doesn't do the "strength of match is inversely proportional to how much
of the population matches" part, and also it has a slick way of comparing
all the components at once by just turning them all into a big unified
string, which is simple but also has bizarre effects: every character in
the IP address string is equally important, and the importance of each
component is a function of how many characters it takes to represent it,
which is...weird. :) But in general, this script is a great start to
look at first.
--Roger