On 05/08/14 17:24, Philipp Winter wrote:
On Tue, Aug 05, 2014 at 11:37:45AM +0200, Karsten Loesing wrote:
Started looking into better algorithms to detect Sybil attacks on the Tor network. Current thinking is that we should define relay similarity metrics like common IP address prefix length or time between first seen in a consensus, go throw the consensus archives, and see which relays look similar but are not defined to be in the same family.
Do you already have some code or more details on that?
Details, yes, see below. Code, not really, but give me a few hours tomorrow to clean it up and put it online.
I'm quite interested in this topic and I'm wondering if it wouldn't be best to start with something simple like cosine similarity [0]. We would have to transform a relay descriptor into a numerical vector and then calculate the cosine similarity to other relay vectors. However, this might scale poorly as we would need (n^2)/2 comparisons.
Sounds great. I'm good at transforming relay data into numerical vectors (well, .csv files), but I have little to no experience with the subsequent analysis. Help much appreciated!
We might also want to weigh the vector's elements differently as some of them are easy to control for an attacker (advertised bandwidth, uptime, flags, exit policy) and others require more effort (IP address, ASN, observed bandwidth). Like you mentioned, the key thing to look at might be time, i.e., uptime and derived features such as "total online time in last month" etc.
Makes sense.
Here are the promised details from a private email thread (code follows tomorrow, I hope):
I wrote a little script that fetches the latest Onionoo details JSON file, computes pair-wise similarity metrics, and writes them to a .csv file. I put that file online (90M):
https://people.torproject.org/~karsten/volatile/relay-similarity.csv.gz
The contained columns are:
- same_family: are the two relays in a mutual family relationship (1) or not (0)?
- common_address_prefix: how many bits does the common IP address prefix have, from 0 to 32? Example: 16 means same /16.
- hours_between_first_seen: how many hours have passed between first seeing the two relays in a consensus, starting at 0?
- same_country: are the two relays located in the same country (1) or not (0)?
- same_region: are the two relays located in the same country and region (1) or not (0)?
- same_city: are the two relays located in the same country, region, and city (1) or not (0)?
- same_autonomous_system: are the two relays located in the same autonomous system (1) or not (0)?
- consensus_weight_difference: what's the absolute difference between the consensus weight values, starting at 0?
I have at least 10 more similarity metrics on a list that I want to implement later. But these see most relevant.
And here's what I tried in R, using the first 500k lines of the .csv file:
o <- read.csv("relay-similarity-part.csv") summary(o) glm.out = glm(same_family ~ common_address_prefix, binomial(logit), data = o) glm.out png("out%1d.png", width=720, height=720, pointsize=16) plot(glm.out) dev.off()
And this is the output I got:
===== BEGIN CONSOLE OUTPUT =====
same_family common_address_prefix hours_between_first_seen Min. :0.000000 Min. : 0.000 Min. : 0 1st Qu.:0.000000 1st Qu.: 0.000 1st Qu.: 1013 Median :0.000000 Median : 1.000 Median : 2595 Mean :0.000112 Mean : 1.453 Mean : 6039 3rd Qu.:0.000000 3rd Qu.: 2.000 3rd Qu.: 7137 Max. :1.000000 Max. :32.000 Max. :58678 same_country same_region same_city same_autonomous_system Min. :0.0000 Min. :0.00000 Min. :0.00000 Min. :0.00000 1st Qu.:0.0000 1st Qu.:0.00000 1st Qu.:0.00000 1st Qu.:0.00000 Median :0.0000 Median :0.00000 Median :0.00000 Median :0.00000 Mean :0.1279 Mean :0.02618 Mean :0.02129 Mean :0.01056 3rd Qu.:0.0000 3rd Qu.:0.00000 3rd Qu.:0.00000 3rd Qu.:0.00000 Max. :1.0000 Max. :1.00000 Max. :1.00000 Max. :1.00000 consensus_weight_difference Min. : 0 1st Qu.: 160 Median : 1223 Mean : 8608 3rd Qu.: 5103 Max. :221998
Call: glm(formula = same_family ~ common_address_prefix, family = binomial(logit), data = o)
Coefficients: (Intercept) common_address_prefix -10.2757 0.2965
Degrees of Freedom: 499999 Total (i.e. Null); 499998 Residual Null Deviance: 1131 Residual Deviance: 853.3 AIC: 857.3 null device 1
===== END CONSOLE OUTPUT =====
https://people.torproject.org/~karsten/volatile/out1.png
https://people.torproject.org/~karsten/volatile/out2.png
https://people.torproject.org/~karsten/volatile/out3.png
https://people.torproject.org/~karsten/volatile/out4.png
I don't really understand much of that output or the produced graphs.
And of course, I'd like to include more than one similarity metric in the analysis. (I know the syntax for using glm() with more than one input variable, but I'd expect results to be even harder to understand.)
All the best, Karsten