On Tue, Aug 05, 2014 at 11:37:45AM +0200, Karsten Loesing wrote:
Started looking into better algorithms to detect Sybil attacks on the Tor network. Current thinking is that we should define relay similarity metrics like common IP address prefix length or time between first seen in a consensus, go throw the consensus archives, and see which relays look similar but are not defined to be in the same family.
Do you already have some code or more details on that? I'm quite interested in this topic and I'm wondering if it wouldn't be best to start with something simple like cosine similarity [0]. We would have to transform a relay descriptor into a numerical vector and then calculate the cosine similarity to other relay vectors. However, this might scale poorly as we would need (n^2)/2 comparisons.
We might also want to weigh the vector's elements differently as some of them are easy to control for an attacker (advertised bandwidth, uptime, flags, exit policy) and others require more effort (IP address, ASN, observed bandwidth). Like you mentioned, the key thing to look at might be time, i.e., uptime and derived features such as "total online time in last month" etc.
[0] https://en.wikipedia.org/wiki/Cosine_similarity
Cheers, Philipp
On 05/08/14 17:24, Philipp Winter wrote:
On Tue, Aug 05, 2014 at 11:37:45AM +0200, Karsten Loesing wrote:
Started looking into better algorithms to detect Sybil attacks on the Tor network. Current thinking is that we should define relay similarity metrics like common IP address prefix length or time between first seen in a consensus, go throw the consensus archives, and see which relays look similar but are not defined to be in the same family.
Do you already have some code or more details on that?
Details, yes, see below. Code, not really, but give me a few hours tomorrow to clean it up and put it online.
I'm quite interested in this topic and I'm wondering if it wouldn't be best to start with something simple like cosine similarity [0]. We would have to transform a relay descriptor into a numerical vector and then calculate the cosine similarity to other relay vectors. However, this might scale poorly as we would need (n^2)/2 comparisons.
Sounds great. I'm good at transforming relay data into numerical vectors (well, .csv files), but I have little to no experience with the subsequent analysis. Help much appreciated!
We might also want to weigh the vector's elements differently as some of them are easy to control for an attacker (advertised bandwidth, uptime, flags, exit policy) and others require more effort (IP address, ASN, observed bandwidth). Like you mentioned, the key thing to look at might be time, i.e., uptime and derived features such as "total online time in last month" etc.
Makes sense.
Here are the promised details from a private email thread (code follows tomorrow, I hope):
I wrote a little script that fetches the latest Onionoo details JSON file, computes pair-wise similarity metrics, and writes them to a .csv file. I put that file online (90M):
https://people.torproject.org/~karsten/volatile/relay-similarity.csv.gz
The contained columns are:
- same_family: are the two relays in a mutual family relationship (1) or not (0)?
- common_address_prefix: how many bits does the common IP address prefix have, from 0 to 32? Example: 16 means same /16.
- hours_between_first_seen: how many hours have passed between first seeing the two relays in a consensus, starting at 0?
- same_country: are the two relays located in the same country (1) or not (0)?
- same_region: are the two relays located in the same country and region (1) or not (0)?
- same_city: are the two relays located in the same country, region, and city (1) or not (0)?
- same_autonomous_system: are the two relays located in the same autonomous system (1) or not (0)?
- consensus_weight_difference: what's the absolute difference between the consensus weight values, starting at 0?
I have at least 10 more similarity metrics on a list that I want to implement later. But these see most relevant.
And here's what I tried in R, using the first 500k lines of the .csv file:
o <- read.csv("relay-similarity-part.csv") summary(o) glm.out = glm(same_family ~ common_address_prefix, binomial(logit), data = o) glm.out png("out%1d.png", width=720, height=720, pointsize=16) plot(glm.out) dev.off()
And this is the output I got:
===== BEGIN CONSOLE OUTPUT =====
same_family common_address_prefix hours_between_first_seen Min. :0.000000 Min. : 0.000 Min. : 0 1st Qu.:0.000000 1st Qu.: 0.000 1st Qu.: 1013 Median :0.000000 Median : 1.000 Median : 2595 Mean :0.000112 Mean : 1.453 Mean : 6039 3rd Qu.:0.000000 3rd Qu.: 2.000 3rd Qu.: 7137 Max. :1.000000 Max. :32.000 Max. :58678 same_country same_region same_city same_autonomous_system Min. :0.0000 Min. :0.00000 Min. :0.00000 Min. :0.00000 1st Qu.:0.0000 1st Qu.:0.00000 1st Qu.:0.00000 1st Qu.:0.00000 Median :0.0000 Median :0.00000 Median :0.00000 Median :0.00000 Mean :0.1279 Mean :0.02618 Mean :0.02129 Mean :0.01056 3rd Qu.:0.0000 3rd Qu.:0.00000 3rd Qu.:0.00000 3rd Qu.:0.00000 Max. :1.0000 Max. :1.00000 Max. :1.00000 Max. :1.00000 consensus_weight_difference Min. : 0 1st Qu.: 160 Median : 1223 Mean : 8608 3rd Qu.: 5103 Max. :221998
Call: glm(formula = same_family ~ common_address_prefix, family = binomial(logit), data = o)
Coefficients: (Intercept) common_address_prefix -10.2757 0.2965
Degrees of Freedom: 499999 Total (i.e. Null); 499998 Residual Null Deviance: 1131 Residual Deviance: 853.3 AIC: 857.3 null device 1
===== END CONSOLE OUTPUT =====
https://people.torproject.org/~karsten/volatile/out1.png
https://people.torproject.org/~karsten/volatile/out2.png
https://people.torproject.org/~karsten/volatile/out3.png
https://people.torproject.org/~karsten/volatile/out4.png
I don't really understand much of that output or the produced graphs.
And of course, I'd like to include more than one similarity metric in the analysis. (I know the syntax for using glm() with more than one input variable, but I'd expect results to be even harder to understand.)
All the best, Karsten
There's a couple posts here. https://lists.torproject.org/pipermail/tor-relays/2014-July/005034.html I'd also include uptimes.
Larger byproduct: you can say what ISP/AS do not have relays in order to see about putting ones therein.
On Tue, 05 Aug 2014 18:00:32 +0200 Karsten Loesing karsten@torproject.org wrote:
On 05/08/14 17:24, Philipp Winter wrote:
On Tue, Aug 05, 2014 at 11:37:45AM +0200, Karsten Loesing wrote:
Started looking into better algorithms to detect Sybil attacks on the Tor network. Current thinking is that we should define relay similarity metrics like common IP address prefix length or time between first seen in a consensus, go throw the consensus archives, and see which relays look similar but are not defined to be in the same family.
Do you already have some code or more details on that?
Details, yes, see below. Code, not really, but give me a few hours tomorrow to clean it up and put it online.
I'm quite interested in this topic and I'm wondering if it wouldn't be best to start with something simple like cosine similarity [0]. We would have to transform a relay descriptor into a numerical vector and then calculate the cosine similarity to other relay vectors. However, this might scale poorly as we would need (n^2)/2 comparisons.
It can be O(n^2 log n).
If you concatenate the many vectors into a single one, the many cosine dot products are a single convolution operation, which is O(n log n) with the FFT. Correlate the big vector (haystack) with the search one (needle) -- indices of peaks give matches. This is cross-correlation.
(As Andrea mentioned, there are conditions. I think they're met.)
So if you want to find matches for some vector X = x1, x2, ... in the set of vectors A, B, ...
A = a1, a2, a3, ... B = b1, b2, ... etc.
then let Z = a1, a2, a3, b1, b2, ... and compute xcorr(Z, X).
Hopefully the result is almost entirely noise, with a few peaks. If there is a positive peak at some index i, then X matches the vector corresponding to the one with index i in Z.
The most useful vectors for correlation are ones which assign zero for "no measurement" (like inactivity) and which have a large variance, because that will bring down false positives significantly. Bandwidth per hour/day, for example, might be useful.
This is basically a kind of passive traffic confirmation attack. It would be funny to use it to detect traffic confirmation attacks.
Sounds great. I'm good at transforming relay data into numerical vectors (well, .csv files), but I have little to no experience with the subsequent analysis. Help much appreciated!
I could try what I described above, with the data "bandwidth over time" for some small set of relays, two of which were known to be colluding. (I apologize if that's already available, it's been a while since I used the Tor metrics.)
Mansour
On 05/08/14 18:00, Karsten Loesing wrote:
On 05/08/14 17:24, Philipp Winter wrote:
On Tue, Aug 05, 2014 at 11:37:45AM +0200, Karsten Loesing wrote:
Started looking into better algorithms to detect Sybil attacks on the Tor network. Current thinking is that we should define relay similarity metrics like common IP address prefix length or time between first seen in a consensus, go throw the consensus archives, and see which relays look similar but are not defined to be in the same family.
Do you already have some code or more details on that?
Details, yes, see below. Code, not really, but give me a few hours tomorrow to clean it up and put it online.
https://github.com/kloesing/SAD
All the best, Karsten
On Tue, Aug 05, 2014 at 11:24:32AM -0400, Philipp Winter wrote:
On Tue, Aug 05, 2014 at 11:37:45AM +0200, Karsten Loesing wrote:
Started looking into better algorithms to detect Sybil attacks on the Tor network. Current thinking is that we should define relay similarity metrics like common IP address prefix length or time between first seen in a consensus, go throw the consensus archives, and see which relays look similar but are not defined to be in the same family.
Do you already have some code or more details on that? I'm quite interested in this topic and I'm wondering if it wouldn't be best to start with something simple like cosine similarity [0]. We would have to transform a relay descriptor into a numerical vector and then calculate the cosine similarity to other relay vectors. However, this might scale poorly as we would need (n^2)/2 comparisons.
We might also want to weigh the vector's elements differently as some of them are easy to control for an attacker (advertised bandwidth, uptime, flags, exit policy) and others require more effort (IP address, ASN, observed bandwidth). Like you mentioned, the key thing to look at might be time, i.e., uptime and derived features such as "total online time in last month" etc.
You only need O(n*log(n)) if you can define any similarity metric that respects the triangle inequality. There's a lot of research on data structures for this:
https://en.wikipedia.org/wiki/Metric_tree