[tor-dev] Sybil attack detection

Tue Aug 5 16:00:32 UTC 2014

On 05/08/14 17:24, Philipp Winter wrote:
> On Tue, Aug 05, 2014 at 11:37:45AM +0200, Karsten Loesing wrote:
>> Started looking into better algorithms to detect Sybil attacks on the
>> Tor network.  Current thinking is that we should define relay similarity
>> metrics like common IP address prefix length or time between first seen
>> in a consensus, go throw the consensus archives, and see which relays
>> look similar but are not defined to be in the same family.
> 
> Do you already have some code or more details on that?

Details, yes, see below.  Code, not really, but give me a few hours
tomorrow to clean it up and put it online.

> I'm quite
> interested in this topic and I'm wondering if it wouldn't be best to
> start with something simple like cosine similarity [0].  We would have
> to transform a relay descriptor into a numerical vector and then
> calculate the cosine similarity to other relay vectors.  However, this
> might scale poorly as we would need (n^2)/2 comparisons.

Sounds great.  I'm good at transforming relay data into numerical
vectors (well, .csv files), but I have little to no experience with the
subsequent analysis.  Help much appreciated!

> We might also want to weigh the vector's elements differently as some of
> them are easy to control for an attacker (advertised bandwidth, uptime,
> flags, exit policy) and others require more effort (IP address, ASN,
> observed bandwidth).  Like you mentioned, the key thing to look at might
> be time, i.e., uptime and derived features such as "total online time in
> last month" etc.

Makes sense.

> [0] https://en.wikipedia.org/wiki/Cosine_similarity

Here are the promised details from a private email thread (code follows
tomorrow, I hope):

I wrote a little script that fetches the latest Onionoo details JSON
file, computes pair-wise similarity metrics, and writes them to a .csv
file.  I put that file online (90M):

https://people.torproject.org/~karsten/volatile/relay-similarity.csv.gz

The contained columns are:

 - same_family: are the two relays in a mutual family relationship (1)
or not (0)?

 - common_address_prefix: how many bits does the common IP address
prefix have, from 0 to 32?  Example: 16 means same /16.

 - hours_between_first_seen: how many hours have passed between first
seeing the two relays in a consensus, starting at 0?

 - same_country: are the two relays located in the same country (1) or
not (0)?

 - same_region: are the two relays located in the same country and
region (1) or not (0)?

 - same_city: are the two relays located in the same country, region,
and city (1) or not (0)?

 - same_autonomous_system: are the two relays located in the same
autonomous system (1) or not (0)?

 - consensus_weight_difference: what's the absolute difference between
the consensus weight values, starting at 0?

I have at least 10 more similarity metrics on a list that I want to
implement later.  But these see most relevant.

And here's what I tried in R, using the first 500k lines of the .csv file:

o <- read.csv("relay-similarity-part.csv")
summary(o)
glm.out = glm(same_family ~ common_address_prefix, binomial(logit),
              data = o)
glm.out
png("out%1d.png", width=720, height=720, pointsize=16)
plot(glm.out)
dev.off()

And this is the output I got:

===== BEGIN CONSOLE OUTPUT =====

  same_family       common_address_prefix hours_between_first_seen
 Min.   :0.000000   Min.   : 0.000        Min.   :    0
 1st Qu.:0.000000   1st Qu.: 0.000        1st Qu.: 1013
 Median :0.000000   Median : 1.000        Median : 2595
 Mean   :0.000112   Mean   : 1.453        Mean   : 6039
 3rd Qu.:0.000000   3rd Qu.: 2.000        3rd Qu.: 7137
 Max.   :1.000000   Max.   :32.000        Max.   :58678
  same_country     same_region        same_city       same_autonomous_system
 Min.   :0.0000   Min.   :0.00000   Min.   :0.00000   Min.   :0.00000
 1st Qu.:0.0000   1st Qu.:0.00000   1st Qu.:0.00000   1st Qu.:0.00000
 Median :0.0000   Median :0.00000   Median :0.00000   Median :0.00000
 Mean   :0.1279   Mean   :0.02618   Mean   :0.02129   Mean   :0.01056
 3rd Qu.:0.0000   3rd Qu.:0.00000   3rd Qu.:0.00000   3rd Qu.:0.00000
 Max.   :1.0000   Max.   :1.00000   Max.   :1.00000   Max.   :1.00000
 consensus_weight_difference
 Min.   :     0
 1st Qu.:   160
 Median :  1223
 Mean   :  8608
 3rd Qu.:  5103
 Max.   :221998

Call:  glm(formula = same_family ~ common_address_prefix, family =
binomial(logit),
    data = o)

Coefficients:
          (Intercept)  common_address_prefix
             -10.2757                 0.2965

Degrees of Freedom: 499999 Total (i.e. Null);  499998 Residual
Null Deviance:	    1131
Residual Deviance: 853.3 	AIC: 857.3
null device
          1

===== END CONSOLE OUTPUT =====

https://people.torproject.org/~karsten/volatile/out1.png

https://people.torproject.org/~karsten/volatile/out2.png

https://people.torproject.org/~karsten/volatile/out3.png

https://people.torproject.org/~karsten/volatile/out4.png

I don't really understand much of that output or the produced graphs.

And of course, I'd like to include more than one similarity metric in
the analysis.  (I know the syntax for using glm() with more than one
input variable, but I'd expect results to be even harder to understand.)

All the best,
Karsten