
On Tue, Aug 05, 2014 at 11:24:32AM -0400, Philipp Winter wrote:
On Tue, Aug 05, 2014 at 11:37:45AM +0200, Karsten Loesing wrote:
Started looking into better algorithms to detect Sybil attacks on the Tor network. Current thinking is that we should define relay similarity metrics like common IP address prefix length or time between first seen in a consensus, go throw the consensus archives, and see which relays look similar but are not defined to be in the same family.
Do you already have some code or more details on that? I'm quite interested in this topic and I'm wondering if it wouldn't be best to start with something simple like cosine similarity [0]. We would have to transform a relay descriptor into a numerical vector and then calculate the cosine similarity to other relay vectors. However, this might scale poorly as we would need (n^2)/2 comparisons.
We might also want to weigh the vector's elements differently as some of them are easy to control for an attacker (advertised bandwidth, uptime, flags, exit policy) and others require more effort (IP address, ASN, observed bandwidth). Like you mentioned, the key thing to look at might be time, i.e., uptime and derived features such as "total online time in last month" etc.
You only need O(n*log(n)) if you can define any similarity metric that respects the triangle inequality. There's a lot of research on data structures for this: https://en.wikipedia.org/wiki/Metric_tree -- Andrea Shepard <andrea@torproject.org> PGP fingerprint (ECC): BDF5 F867 8A52 4E4A BECF DE79 A4FF BC34 F01D D536 PGP fingerprint (RSA): 3611 95A4 0740 ED1B 7EA5 DF7E 4191 13D9 D0CF BDA5