# [tor-dev] Sybil attack detection

Mansour Moufid mansourmoufid at gmail.com
Wed Aug 6 18:21:48 UTC 2014

```On Tue, 05 Aug 2014 18:00:32 +0200
Karsten Loesing <karsten at torproject.org> wrote:

> On 05/08/14 17:24, Philipp Winter wrote:
> > On Tue, Aug 05, 2014 at 11:37:45AM +0200, Karsten Loesing wrote:
> >> Started looking into better algorithms to detect Sybil attacks on the
> >> Tor network.  Current thinking is that we should define relay similarity
> >> metrics like common IP address prefix length or time between first seen
> >> in a consensus, go throw the consensus archives, and see which relays
> >> look similar but are not defined to be in the same family.
> >
> > Do you already have some code or more details on that?
>
> Details, yes, see below.  Code, not really, but give me a few hours
> tomorrow to clean it up and put it online.
>
> > I'm quite
> > interested in this topic and I'm wondering if it wouldn't be best to
> > start with something simple like cosine similarity [0].  We would have
> > to transform a relay descriptor into a numerical vector and then
> > calculate the cosine similarity to other relay vectors.  However, this
> > might scale poorly as we would need (n^2)/2 comparisons.

It can be O(n^2 log n).

If you concatenate the many vectors into a single one, the many cosine
dot products are a single convolution operation, which is O(n log n)
with the FFT. Correlate the big vector (haystack) with the search one
(needle) -- indices of peaks give matches. This is cross-correlation.

(As Andrea mentioned, there are conditions. I think they're met.)

So if you want to find matches for some vector X = x1, x2, ... in
the set of vectors A, B, ...

A = a1, a2, a3, ...
B = b1, b2, ...
etc.

then let Z = a1, a2, a3, b1, b2, ... and compute xcorr(Z, X).

Hopefully the result is almost entirely noise, with a few peaks.
If there is a positive peak at some index i, then X matches the
vector corresponding to the one with index i in Z.

The most useful vectors for correlation are ones which assign zero
for "no measurement" (like inactivity) and which have a large
variance, because that will bring down false positives significantly.
Bandwidth per hour/day, for example, might be useful.

This is basically a kind of passive traffic confirmation attack. It
would be funny to use it to detect traffic confirmation attacks.

> Sounds great.  I'm good at transforming relay data into numerical
> vectors (well, .csv files), but I have little to no experience with
> the subsequent analysis.  Help much appreciated!

I could try what I described above, with the data "bandwidth over time"
for some small set of relays, two of which were known to be colluding.
(I apologize if that's already available, it's been a while since I
used the Tor metrics.)

Mansour
```