On Tue, Aug 05, 2014 at 11:37:45AM +0200, Karsten Loesing wrote:
Started looking into better algorithms to detect Sybil attacks on the Tor network. Current thinking is that we should define relay similarity metrics like common IP address prefix length or time between first seen in a consensus, go throw the consensus archives, and see which relays look similar but are not defined to be in the same family.
Do you already have some code or more details on that? I'm quite interested in this topic and I'm wondering if it wouldn't be best to start with something simple like cosine similarity [0]. We would have to transform a relay descriptor into a numerical vector and then calculate the cosine similarity to other relay vectors. However, this might scale poorly as we would need (n^2)/2 comparisons.
We might also want to weigh the vector's elements differently as some of them are easy to control for an attacker (advertised bandwidth, uptime, flags, exit policy) and others require more effort (IP address, ASN, observed bandwidth). Like you mentioned, the key thing to look at might be time, i.e., uptime and derived features such as "total online time in last month" etc.
[0] https://en.wikipedia.org/wiki/Cosine_similarity
Cheers, Philipp