On Tue, 05 Aug 2014 18:00:32 +0200 Karsten Loesing karsten@torproject.org wrote:
On 05/08/14 17:24, Philipp Winter wrote:
On Tue, Aug 05, 2014 at 11:37:45AM +0200, Karsten Loesing wrote:
Started looking into better algorithms to detect Sybil attacks on the Tor network. Current thinking is that we should define relay similarity metrics like common IP address prefix length or time between first seen in a consensus, go throw the consensus archives, and see which relays look similar but are not defined to be in the same family.
Do you already have some code or more details on that?
Details, yes, see below. Code, not really, but give me a few hours tomorrow to clean it up and put it online.
I'm quite interested in this topic and I'm wondering if it wouldn't be best to start with something simple like cosine similarity [0]. We would have to transform a relay descriptor into a numerical vector and then calculate the cosine similarity to other relay vectors. However, this might scale poorly as we would need (n^2)/2 comparisons.
It can be O(n^2 log n).
If you concatenate the many vectors into a single one, the many cosine dot products are a single convolution operation, which is O(n log n) with the FFT. Correlate the big vector (haystack) with the search one (needle) -- indices of peaks give matches. This is cross-correlation.
(As Andrea mentioned, there are conditions. I think they're met.)
So if you want to find matches for some vector X = x1, x2, ... in the set of vectors A, B, ...
A = a1, a2, a3, ... B = b1, b2, ... etc.
then let Z = a1, a2, a3, b1, b2, ... and compute xcorr(Z, X).
Hopefully the result is almost entirely noise, with a few peaks. If there is a positive peak at some index i, then X matches the vector corresponding to the one with index i in Z.
The most useful vectors for correlation are ones which assign zero for "no measurement" (like inactivity) and which have a large variance, because that will bring down false positives significantly. Bandwidth per hour/day, for example, might be useful.
This is basically a kind of passive traffic confirmation attack. It would be funny to use it to detect traffic confirmation attacks.
Sounds great. I'm good at transforming relay data into numerical vectors (well, .csv files), but I have little to no experience with the subsequent analysis. Help much appreciated!
I could try what I described above, with the data "bandwidth over time" for some small set of relays, two of which were known to be colluding. (I apologize if that's already available, it's been a while since I used the Tor metrics.)
Mansour