commit 22700d31144c1b8f5c3cc954634f4db9ceffec30 Author: Karsten Loesing karsten.loesing@gmx.net Date: Tue Mar 15 14:49:20 2011 +0100
Add an example analysis and fix a minor bug. --- task-2680/ProcessSanitizedBridges.java | 2 +- task-2680/README | 138 +++++++++++++++++++++++++++----- task-2680/analysis.R | 50 ++++++++++++ 3 files changed, 169 insertions(+), 21 deletions(-)
diff --git a/task-2680/ProcessSanitizedBridges.java b/task-2680/ProcessSanitizedBridges.java index 1f0e00e..c3ab6c8 100644 --- a/task-2680/ProcessSanitizedBridges.java +++ b/task-2680/ProcessSanitizedBridges.java @@ -84,7 +84,7 @@ public class ProcessSanitizedBridges { String fingerprint = Hex.encodeHexString(Base64.decodeBase64( parts[2] + "=")); String descriptor = Hex.encodeHexString(Base64.decodeBase64( - parts[2] + "=")); + parts[3] + "=")); String published = parts[4] + " " + parts[5]; String address = parts[6]; String orPort = parts[7]; diff --git a/task-2680/README b/task-2680/README index 65d8b85..69aec70 100644 --- a/task-2680/README +++ b/task-2680/README @@ -1,3 +1,22 @@ +Presenting bridge usage data so that researchers can focus on the math +====================================================================== + + "Right now the process of learning how to parse bridge consensus files, + bridge descriptor files, match up which descriptors go with which + consensus line, which bridges were Running when, etc is too + burdensome -- researchers who want to analyze bridge reachability are + giving up before they even get to the part they tried to sign up for." + (from arma's description of this ticket in Trac) + +This ticket contains the code to process the data tarballs from the +metrics website and convert them to a format that is more useful for +researchers. This README also contains instructions for working with the +new data formats. + + +1 Processing data tarballs from metrics.tpo +-------------------------------------------- + This ticket contains Java and R code to
a) process bridge and relay data to convert them to a format that is more @@ -6,13 +25,9 @@ This ticket contains Java and R code to
This README has a separate section for each Java or R code snippet.
-The Java applications produce four output formats containing bridge -descriptors, bridge status lines, bridge pool assignments, and hashed -relay identities. The data formats are described below. - ---------------------------------------------------------------------------
-ProcessSanitizedBridges.java +1.1 ProcessSanitizedBridges.java +---------------------------------
- Download sanitized bridge descriptors from the metrics website, e.g., https://metrics.torproject.org/data/bridge-descriptors-2011-01.tar.bz2, @@ -31,9 +46,9 @@ ProcessSanitizedBridges.java - Once the Java application is done, you'll find the two files statuses.csv and descriptors.csv in this directory.
---------------------------------------------------------------------------
-ProcessSanitizedAssignments.java +1.2 ProcessSanitizedAssignments.java +-------------------------------------
- Download sanitized bridge pool assignments from the metrics website, e.g., https://metrics.torproject.org/data/bridge-pool-assignments-2011-01.tar.bz2 @@ -48,9 +63,9 @@ ProcessSanitizedAssignments.java - Once the Java application is done, you'll find a file assignments.csv in this directory.
---------------------------------------------------------------------------
-ProcessRelayConsensuses.java +1.3 ProcessRelayConsensuses.java +---------------------------------
- Download v3 relay consensuses from the metrics website, e.g., https://metrics.torproject.org/data/consensuses-2011-01.tar.bz2, and @@ -69,16 +84,24 @@ ProcessRelayConsensuses.java - Once the Java application is done, you'll find a file relays.csv in this directory.
---------------------------------------------------------------------------
-verify.R +1.4 verify.R +-------------
- Run the R verification script like this: $ R --slave -f verify.R
---------------------------------------------------------------------------
-descriptors.csv +2 New data formats +------------------- + +The Java applications produce four output formats containing bridge +descriptors, bridge status lines, bridge pool assignments, and hashed +relay identities. The data formats are described below. + + +2.1 descriptors.csv +--------------------
The descriptors.csv file contains one line for each bridge descriptor that a bridge has published. This descriptor consists of fields coming from @@ -115,9 +138,9 @@ Bridges running early 0.2.2.x versions published faulty stats and are therefore removed from descriptors.csv. Bridges running 0.2.2.x or higher (except the faulty 0.2.2.x versions) collect stats in 24-hour intervals.
---------------------------------------------------------------------------
-statuses.csv +2.2 statuses.csv +-----------------
The statuses.csv file contains one line for every bridge that is referenced in a bridge network status. Note that if a bridge is running @@ -145,9 +168,16 @@ The columns in statuses.csv are: - valid: TRUE if bridge has the Valid flag, FALSE otherwise - v2dir: TRUE if bridge has the V2Dir flag, FALSE otherwise
--------------------------------------------------------------------------- +Note that there is no tight relation between statuses.csv and +descriptors.csv when it comes to bridge usage statistics (even though +one can link them via the bridge's server descriptor identifier). A +bridge is free to write anything in its extra-info descriptor, including a +few days old bridge statistics. That is in no way related to the bridge +authority thinking that a bridge is running at a later time. +
-assignments.csv +2.3 assignments.csv +--------------------
The assignments.csv file contains one line for every running bridge and the rings, subrings, and buckets that BridgeDB assigned it to. @@ -162,9 +192,9 @@ The columns in assignments.csv are: - flag: Flag subring - bucket: File bucket, only for distributor "unallocated"
---------------------------------------------------------------------------
-relays.csv +2.4 relays.csv +---------------
The relays.csv file contains SHA-1 hashes of identity fingerprints of normal relays. If a bridge uses the same identity key that it also used @@ -177,3 +207,71 @@ The columns in relays.csv are: - consensus: ISO-formatted consensus publication time - fingerprint: Hex-formatted SHA-1 hash of identity fingerprint
+ +3 Working with the new data formats +------------------------------------ + +The new data formats are plain CSV files that can be processed by many +statistics tools, including R. For some analyses it may be sufficient to +evaluate a single CSV file and be done. But most analyses would require +combining two or more of the CSV files. + +See analysis.R for an example analysis. Run it like this: + + $ R --slave -f analysis.R + +Below is the output in case you don't have R installed but want to know +what kind of results to expect: + +Reading descriptors.csv. +Read 97394 rows from descriptors.csv. +28429 of these rows have bridge stats. +Here are the first 10 rows, sorted by fingerprint and bridge stats +interval end, and only displaying German and French users: + fingerprint bridgestatsend de fr +45933 0008b101e9dcbcfa11ba638b86d71afdef54a4b5 2011-01-02 11:32:47 0 0 +21782 0008b101e9dcbcfa11ba638b86d71afdef54a4b5 2011-01-02 11:33:53 0 0 +18869 0008b101e9dcbcfa11ba638b86d71afdef54a4b5 2011-01-02 11:53:07 0 0 +5182 0008b101e9dcbcfa11ba638b86d71afdef54a4b5 2011-01-02 19:23:52 0 0 +48686 0008b101e9dcbcfa11ba638b86d71afdef54a4b5 2011-01-03 09:38:20 0 0 +33774 0008b101e9dcbcfa11ba638b86d71afdef54a4b5 2011-01-03 19:30:08 0 0 +67666 0008b101e9dcbcfa11ba638b86d71afdef54a4b5 2011-01-03 22:11:47 0 0 +31329 0008b101e9dcbcfa11ba638b86d71afdef54a4b5 2011-01-06 09:14:07 0 0 +31668 0008b101e9dcbcfa11ba638b86d71afdef54a4b5 2011-01-07 11:23:26 0 0 +16943 0008b101e9dcbcfa11ba638b86d71afdef54a4b5 2011-01-08 11:49:26 0 0 +Reading relays.csv +Read 1606208 rows from relays.csv. +Filtering out bridges that have been seen as relays. +26425 descriptors remain. Again, here are the first 10 rows, sorted by +fingerprint and bridge stats interval end, and only displaying German +and French users: + fingerprint bridgestatsend de fr +45933 0008b101e9dcbcfa11ba638b86d71afdef54a4b5 2011-01-02 11:32:47 0 0 +21782 0008b101e9dcbcfa11ba638b86d71afdef54a4b5 2011-01-02 11:33:53 0 0 +18869 0008b101e9dcbcfa11ba638b86d71afdef54a4b5 2011-01-02 11:53:07 0 0 +5182 0008b101e9dcbcfa11ba638b86d71afdef54a4b5 2011-01-02 19:23:52 0 0 +48686 0008b101e9dcbcfa11ba638b86d71afdef54a4b5 2011-01-03 09:38:20 0 0 +33774 0008b101e9dcbcfa11ba638b86d71afdef54a4b5 2011-01-03 19:30:08 0 0 +67666 0008b101e9dcbcfa11ba638b86d71afdef54a4b5 2011-01-03 22:11:47 0 0 +31329 0008b101e9dcbcfa11ba638b86d71afdef54a4b5 2011-01-06 09:14:07 0 0 +31668 0008b101e9dcbcfa11ba638b86d71afdef54a4b5 2011-01-07 11:23:26 0 0 +16943 0008b101e9dcbcfa11ba638b86d71afdef54a4b5 2011-01-08 11:49:26 0 0 +Reading assignments.csv +Read 778561 rows from assignments.csv. +Filtering out bridges that have not been distributed via email. +14684 descriptors remain. Again, Here are the first 10 rows, sorted by +fingerprint and bridge stats interval end, and only displaying German +and French users: + fingerprint bridgestatsend de fr +66036 003817328def77002ff276a9af54bc4326a86d1c 2011-01-01 05:53:12 32 8 +61891 003817328def77002ff276a9af54bc4326a86d1c 2011-01-01 11:46:58 32 8 +54391 003817328def77002ff276a9af54bc4326a86d1c 2011-01-02 03:32:30 40 8 +73165 003817328def77002ff276a9af54bc4326a86d1c 2011-01-02 21:33:14 48 8 +82707 003817328def77002ff276a9af54bc4326a86d1c 2011-01-03 03:47:23 48 8 +5300 003817328def77002ff276a9af54bc4326a86d1c 2011-01-03 21:48:10 32 8 +23940 003817328def77002ff276a9af54bc4326a86d1c 2011-01-04 15:48:56 32 8 +2706 003817328def77002ff276a9af54bc4326a86d1c 2011-01-05 09:49:39 40 8 +17273 003817328def77002ff276a9af54bc4326a86d1c 2011-01-06 03:50:23 24 8 +72380 003817328def77002ff276a9af54bc4326a86d1c 2011-01-06 21:51:09 24 8 +Terminating. + diff --git a/task-2680/analysis.R b/task-2680/analysis.R new file mode 100644 index 0000000..fbe3199 --- /dev/null +++ b/task-2680/analysis.R @@ -0,0 +1,50 @@ +# Read descriptors.csv. +cat("Reading descriptors.csv.\n") +data <- read.csv("descriptors.csv", stringsAsFactors = FALSE) +cat("Read", length(data$fingerprint), "rows from descriptors.csv.\n") + +# We're interested in bridge stats. Let's filter out all descriptors that +# don't have any bridge stats. +data <- data[!is.na(data$bridgestatsend), ] +cat(length(data$fingerprint), "of these rows have bridge stats.\n") + +# Sort data first by bridge fingeprint, then by bridge stats interval end. +data <- data[order(data$fingerprint, data$bridgestatsend), ] +cat("Here are the first 10 rows, sorted by fingerprint and bridge", + "stats\ninterval end, and only displaying German and French users:\n") +data[1:10, c("fingerprint", "bridgestatsend", "de", "fr")] + +# Looks good, but we should exclude all bridges that have been seen as +# relays, or they will skew our results. Read relays.csv. +cat("Reading relays.csv\n") +relays <- read.csv("relays.csv", stringsAsFactors = FALSE) +cat("Read", length(relays$fingerprint), "rows from relays.csv.\n") + +# Filter out all descriptors of bridges that have been seen as relays. +cat("Filtering out bridges that have been seen as relays.\n") +data <- data[!data$fingerprint %in% relays$fingerprint, ] +cat(length(data$fingerprint), "descriptors remain. Again, here are the", + "first 10 rows, sorted by\nfingerprint and bridge stats interval", + "end, and only displaying German\nand French users:\n") +data[1:10, c("fingerprint", "bridgestatsend", "de", "fr")] + +# And finally, we only want to know bridge statistics of the bridges that +# were distributed via email. Read assignments.csv. +cat("Reading assignments.csv\n") +assignments <- read.csv("assignments.csv", stringsAsFactors = FALSE) +cat("Read", length(assignments$fingerprint), "rows from", + "assignments.csv.\n") + +# Filter out all descriptors of bridges that were not assigned to the +# email distributor. +cat("Filtering out bridges that have not been distributed via email.\n") +data <- data[!data$fingerprint %in% + assignments[assignments$type == 'email', "fingerprint"], ] +cat(length(data$fingerprint), "descriptors remain. Again, Here are the", + "first 10 rows, sorted by\nfingerprint and bridge stats interval", + "end, and only displaying German\nand French users:\n") +data[1:10, c("fingerprint", "bridgestatsend", "de", "fr")] + +# That's it. +cat("Terminating.\n") +