[tor-dev] Sybil attack detection
karsten at torproject.org
Tue Aug 5 16:00:32 UTC 2014
On 05/08/14 17:24, Philipp Winter wrote:
> On Tue, Aug 05, 2014 at 11:37:45AM +0200, Karsten Loesing wrote:
>> Started looking into better algorithms to detect Sybil attacks on the
>> Tor network. Current thinking is that we should define relay similarity
>> metrics like common IP address prefix length or time between first seen
>> in a consensus, go throw the consensus archives, and see which relays
>> look similar but are not defined to be in the same family.
> Do you already have some code or more details on that?
Details, yes, see below. Code, not really, but give me a few hours
tomorrow to clean it up and put it online.
> I'm quite
> interested in this topic and I'm wondering if it wouldn't be best to
> start with something simple like cosine similarity . We would have
> to transform a relay descriptor into a numerical vector and then
> calculate the cosine similarity to other relay vectors. However, this
> might scale poorly as we would need (n^2)/2 comparisons.
Sounds great. I'm good at transforming relay data into numerical
vectors (well, .csv files), but I have little to no experience with the
subsequent analysis. Help much appreciated!
> We might also want to weigh the vector's elements differently as some of
> them are easy to control for an attacker (advertised bandwidth, uptime,
> flags, exit policy) and others require more effort (IP address, ASN,
> observed bandwidth). Like you mentioned, the key thing to look at might
> be time, i.e., uptime and derived features such as "total online time in
> last month" etc.
>  https://en.wikipedia.org/wiki/Cosine_similarity
Here are the promised details from a private email thread (code follows
tomorrow, I hope):
I wrote a little script that fetches the latest Onionoo details JSON
file, computes pair-wise similarity metrics, and writes them to a .csv
file. I put that file online (90M):
The contained columns are:
- same_family: are the two relays in a mutual family relationship (1)
or not (0)?
- common_address_prefix: how many bits does the common IP address
prefix have, from 0 to 32? Example: 16 means same /16.
- hours_between_first_seen: how many hours have passed between first
seeing the two relays in a consensus, starting at 0?
- same_country: are the two relays located in the same country (1) or
- same_region: are the two relays located in the same country and
region (1) or not (0)?
- same_city: are the two relays located in the same country, region,
and city (1) or not (0)?
- same_autonomous_system: are the two relays located in the same
autonomous system (1) or not (0)?
- consensus_weight_difference: what's the absolute difference between
the consensus weight values, starting at 0?
I have at least 10 more similarity metrics on a list that I want to
implement later. But these see most relevant.
And here's what I tried in R, using the first 500k lines of the .csv file:
o <- read.csv("relay-similarity-part.csv")
glm.out = glm(same_family ~ common_address_prefix, binomial(logit),
data = o)
png("out%1d.png", width=720, height=720, pointsize=16)
And this is the output I got:
===== BEGIN CONSOLE OUTPUT =====
same_family common_address_prefix hours_between_first_seen
Min. :0.000000 Min. : 0.000 Min. : 0
1st Qu.:0.000000 1st Qu.: 0.000 1st Qu.: 1013
Median :0.000000 Median : 1.000 Median : 2595
Mean :0.000112 Mean : 1.453 Mean : 6039
3rd Qu.:0.000000 3rd Qu.: 2.000 3rd Qu.: 7137
Max. :1.000000 Max. :32.000 Max. :58678
same_country same_region same_city same_autonomous_system
Min. :0.0000 Min. :0.00000 Min. :0.00000 Min. :0.00000
1st Qu.:0.0000 1st Qu.:0.00000 1st Qu.:0.00000 1st Qu.:0.00000
Median :0.0000 Median :0.00000 Median :0.00000 Median :0.00000
Mean :0.1279 Mean :0.02618 Mean :0.02129 Mean :0.01056
3rd Qu.:0.0000 3rd Qu.:0.00000 3rd Qu.:0.00000 3rd Qu.:0.00000
Max. :1.0000 Max. :1.00000 Max. :1.00000 Max. :1.00000
Min. : 0
1st Qu.: 160
Median : 1223
Mean : 8608
3rd Qu.: 5103
Call: glm(formula = same_family ~ common_address_prefix, family =
data = o)
Degrees of Freedom: 499999 Total (i.e. Null); 499998 Residual
Null Deviance: 1131
Residual Deviance: 853.3 AIC: 857.3
===== END CONSOLE OUTPUT =====
I don't really understand much of that output or the produced graphs.
And of course, I'd like to include more than one similarity metric in
the analysis. (I know the syntax for using glm() with more than one
input variable, but I'd expect results to be even harder to understand.)
All the best,
More information about the tor-dev