Hi David!
A feature request based on (what I understand of) your hackweek project:
I often find myself in the position where I have a set of fingerprints, and I want to know if I've "got them all". So something that inputs a set of fingerprints, and outputs an ordered list of candidate "similar" relays, would be a great tool.
What exactly we mean by similar, of course, will be a thing we need to iterate on. If we had a huge dataset -- with ground truth -- we could machine learning it and see how the black box gods answer. But while that thought leads to "good idea, let's collect that data set", we need something else in the mean time.
Here's a start:
* Component-wise similarity is computed differently for each component (version/contactinfo/nickname is a string comparison, IP address is looking for other relays on the same /x network, uptime is computing the real uptime based on the descriptor timestamp minus the descriptor uptime, etc),
* But the *strength* of a match is based on how few other relays also matched. That is, we find a component match less compelling the more relays there are that match. For example, maybe the /16 address similarity is a weak match, because there are many relays on that same /16, but the /24 address similarity is a strong match because there's only one other relay closeby.
* We can start out with simple comparisons, but eventually we'll want fuzzy comparisons, where e.g. we look at the Levenshtein distance for nickname ("Anonymous1, Anonymous2, Anonymous3"), or check if the two uptimes are within a threshold percentage of each other, etc.
* And then the overall similarity is a function of the strength of each component-wise similarity. To start, we could just have it be a linear combination of each, with equal weighting, which means the more components have a strong match, the greater the overall similarity score. There's a lot of opportunity for improvement here, like, weighting certain components more, or non-linear weights where e.g. it's especially interesting when both bandwidthrate and bandwidthburst are strong matches.
* So that's how we decide if *two* relays are similar to each other. But if I give you *ten* relays, we need a way to figure out what those relays have in common, to decide whether an eleventh relay is related to the group as a whole. The easy version here is to just do pairwise comparisons between each input relay and each consensus relay, and then sum the overall similarities into a score for each candidate. This approach should *implicitly* take into account similarity between input relays, on the theory that if A is similar to C and B is similar to C, then A is probably similar to B too. For extra credit, we could do some smarter combination than just summing them, but I don't have anything in mind yet.
For comparison, check out phw's trnnr tool from the SybilHunter paper: https://github.com/NullHypothesis/trnnr/blob/master/trnnr.py which does a simplified version of the above (it only takes in one relay, it doesn't do the "strength of match is inversely proportional to how much of the population matches" part, and also it has a slick way of comparing all the components at once by just turning them all into a big unified string, which is simple but also has bizarre effects: every character in the IP address string is equally important, and the importance of each component is a function of how many characters it takes to represent it, which is...weird. :) But in general, this script is a great start to look at first.
--Roger
network-health@lists.torproject.org