[tor-commits] [torperf/master] Add scripts for filtering and visualizing Torperf data.

karsten at torproject.org karsten at torproject.org
Tue Mar 8 16:42:37 UTC 2011


commit a257846fda8e01261817f9c286de43fa49c9f1c2
Author: tomb <tomb at torproject.org>
Date:   Wed Mar 2 10:39:21 2011 -0500

    Add scripts for filtering and visualizing Torperf data.
    
    Implements #2563.
---
 README               |   14 ++++-
 metrics/HOWTO        |  109 ++++++++++++++++++++++++++++++++++++
 metrics/filter.R     |  149 ++++++++++++++++++++++++++++++++++++++++++++++++++
 metrics/timematrix.R |   36 ++++++++++++
 4 files changed, 305 insertions(+), 3 deletions(-)

diff --git a/README b/README
index c4769d4..48b9617 100644
--- a/README
+++ b/README
@@ -8,9 +8,9 @@ Contents
   via SOCKS 4a/5 and outputs timing information
  util.c: Utility functions for trivsocks-client.c
  util.h: Utility function declarations for trivsocks-client.c
- 
+
  Makefile: Builds and tests trivsocks-client
- 
+
 [run_test.py: Script to automate running of trivsocks-client -- defect]
 [plot_results.R: Plot the results from run_test.py -- defect]
 
@@ -27,5 +27,13 @@ Contents
   performance data and path data
 
  LICENSE: The Tor license (3-clause BSD)
- README: This file 
+ README: This file
+
+Subdirectory /metrics
+------------ --------
+
+A set of utilities for filtering and graphing Tor performance data.
 
+ filter.R: filters torperf data and prepares it for graphing
+ timematrix.R: graphs tordata for interpretation and visualization
+ HOWTO: documentation and examples
diff --git a/metrics/HOWTO b/metrics/HOWTO
new file mode 100644
index 0000000..ce6f7eb
--- /dev/null
+++ b/metrics/HOWTO
@@ -0,0 +1,109 @@
+HOWTO -- How to generate nifty graphs of tor performance
+
+Welcome traveler!  You have reached the howto for some tor performance
+and metrics stuff.  You will find here some techniques and scripts
+developed during several tasks including:
+#1919; in which we examine torperfs with fixed entry guards
+#2543; in which we create graphs of #1919 data
+#2563; in which we generalize techniques from #2543 for the future
+
+The remainder of this HOWTO will walk you through what you need to do
+to use the generalized techniques to generate graphs from performance
+data.  We will use #2543 as an example, because it is from this
+example that the generalized technique was derived.  This is intended
+to be a living document.  If something is unclear, or if you wish to
+request a feature, please open a ticket:
+https://trac.torproject.org/projects/tor/newticket
+
+As far as I know, this document was written by Karsten, Mike Perry,
+and Tom Benjamin.  If you are also an author of this document, please
+add yourself to this list.
+
+Step 1: Download Torperf request files
+--------------------------------------
+
+The 15 Torperf request files are available here:
+
+  https://metrics.torproject.org/data.html#performance
+
+The wget commands to download all of them are:
+
+  wget https://metrics.torproject.org/data/torperf-50kb.data
+  wget https://metrics.torproject.org/data/torperf-1mb.data
+  wget https://metrics.torproject.org/data/torperf-5mb.data
+  wget https://metrics.torproject.org/data/torperffastratio-50kb.data
+  wget https://metrics.torproject.org/data/torperffastratio-1mb.data
+  wget https://metrics.torproject.org/data/torperffastratio-5mb.data
+  wget https://metrics.torproject.org/data/torperffast-50kb.data
+  wget https://metrics.torproject.org/data/torperffast-1mb.data
+  wget https://metrics.torproject.org/data/torperffast-5mb.data
+  wget https://metrics.torproject.org/data/torperfslow-50kb.data
+  wget https://metrics.torproject.org/data/torperfslow-1mb.data
+  wget https://metrics.torproject.org/data/torperfslow-5mb.data
+  wget https://metrics.torproject.org/data/torperfslowratio-50kb.data
+  wget https://metrics.torproject.org/data/torperfslowratio-1mb.data
+  wget https://metrics.torproject.org/data/torperfslowratio-5mb.data
+
+Note that the torperf-*.data files are quite big already (25M+).
+
+
+Step 2: Install R and ggplot2
+-----------------------------
+
+Install R 2.8 or higher.
+
+Run R as user and install ggplot2, quit R, start R again and try to load
+ggplot2:
+
+  $ R
+  > install.packages("ggplot2")
+  > q()     # No need to save the workspace image, ever.
+  $ R
+  > library(ggplot2)
+  > q()
+
+
+Step 3: Filter the data
+-----------------------
+
+Before actually graphing the Torperf data, we should filter it to avoid
+reading 29M of data for each graph.  filter.R is a script that
+accomplishes this task, writing it's output to filtered.csv
+It is used as follows:
+
+1) Decide which files you are interested in.  If you only want graphs
+based on the fast guard nodes, you only need to crunch those files.
+
+2) Decide what date range you are interested in.  The default is to
+include all data since 2001-02-01 until 2099-12-31, by which time I
+expect this script may be obsolete.
+
+usage: R --slave -f filter.R --args [-start=DATE] [-end=DATE] FILENAME(S)
+
+filename must be of the form guardname-basesizeSUFFIX.data
+where SUFFIX is one of kb, mb, gb, tb
+ eg: R --slave -f filter.R --args -start=2011-02-01 -end=2099-12-31 *.data
+ eg: R --slave -f filter.R --args torperf-50kb.data
+
+So, to filter all data from #1919 you would execute:
+  $ R --slave -f filter.R --args *.data
+
+The script may take some time to run if the data files are large.
+
+
+Step 4: Visualize the data
+--------------------------
+
+Let's start with plotting a matrix of completion times graphs for every
+file size and guard selection.
+
+  $ R --slave -f timematrix.R
+
+This execution may take around 15 seconds.
+
+
+Step 5: Find a more useful visualization of the data
+----------------------------------------------------
+
+... TODO ...
+
diff --git a/metrics/filter.R b/metrics/filter.R
new file mode 100644
index 0000000..f069856
--- /dev/null
+++ b/metrics/filter.R
@@ -0,0 +1,149 @@
+## A new and "improved" genericised version of the old filter script
+## This version was created for task 2563
+## See HOWTO to put this in context
+##
+## usage: R -f filter.R --args [-start=DATE] [-end=DATE] FILENAME(S)
+## filename must be of the form guardname-basesizeSUFFIX.data
+## where SUFFIX is one of kb, mb, gb, tb
+##
+## eg: R -f filter.R --args -start=2011-02-01 -end=2099-12-31 *.data
+## eg: R -f filter.R --args torperf-50kb.data
+##
+## This R script reads in Torperf files as specified on the command line
+## and writes a filtered version to filtered.csv for later processing.
+
+FilterMain <- function(ARGV) {
+  kDebug <- FALSE # set TRUE for debugging output
+  kVersion <- 0.3
+  if (kDebug) { cat("filter.R version ", kVersion, "\n\n") }
+  files <- NULL # files is a list of torperfFiles as definied below
+  setClass("torperfFile",
+           representation(
+                          filename = "character",
+                          guardLabel = "character",
+                          filesizeLabel = "character",
+                          filesize = "numeric"
+                          )
+           )
+
+  ## default values
+  ## cutoff dates for observations
+  start <- as.POSIXct("2011-02-01", origin = "1970-01-01")
+  end   <- as.POSIXct("2099-12-31", origin = "1970-01-01")
+
+  ## process command line arguments
+  args <- unlist(strsplit(ARGV, " "))
+
+  ## there are better ways to process command line args, but this works for me :-)
+  for (arg in args) {
+    if (kDebug) { cat('arg: ', arg, "\n") }
+    ## if start date specified
+    if (length(splitArgL <- unlist(strsplit(arg, "-start="))) == 2) {
+      if (kDebug) { cat('Starting from ', splitArgL[2], '\n') }
+      start <- as.POSIXct(splitArgL[2], origin = "1970-01-01")
+      next
+    }
+    ## if end date specified
+    if (length(splitArgL <- unlist(strsplit(arg, "-end="))) == 2) {
+      if (kDebug) { cat('Ending at ', splitArgL[2], '\n') }
+      end <- as.POSIXct(splitArgL[2], origin = "1970-01-01")
+      next
+    }
+    ## if the argument is -start= or -end= we will not reach this line
+    ## now, if it isn't a parameter add it to the file list
+    ## parse filename for metadata...
+    ## examples:
+    ##   "torperf-50kb.data" should result in
+    ##     filename = "torperf-50kb.data"
+    ##     guardLabel = "torperf"
+    ##     filesizeLabel = "50kb"
+    ##     filesize = 50 * 1024
+    my.file <- new("torperfFile", filename = arg)
+
+    ## get base filename (strip out leading parts of filename such as dirname)
+    baseFilename <- basename(my.file at filename)
+    parseFileStr <- unlist(strsplit(baseFilename, "-")) ## split the two parts of the filename string
+    if (length(parseFileStr) != 2) {
+      cat("error: filenames must be of the form guard-filesize.data, you said \"", baseFilename, "\"\n")
+      quit("no", 1)
+    }
+    my.file at guardLabel <- parseFileStr[1]
+    cdr <- parseFileStr[2]
+    parseFilesize <- unlist(strsplit(cdr, "\\."))
+    if (length(parseFilesize) != 2) {
+      cat("error: tail of filename must be filesize.data, you said \"", cdr, "\"\n")
+      quit("no", 1)
+    }
+    my.file at filesizeLabel <- tolower(parseFilesize[1]) ## smash case to make our life easier
+
+    fileBaseSize <- as.integer(unlist(strsplit(my.file at filesizeLabel, "[a-z]"))[1])
+    fileSizeMultiplierStr <- unlist(strsplit(my.file at filesizeLabel, '[0-9]'))
+    fileSizeMultiplierStr <- fileSizeMultiplierStr[length(fileSizeMultiplierStr)]
+    fileSizeMultiplier <- 1 ## assume no suffix
+    if (fileSizeMultiplierStr == "kb") { fileSizeMultiplier <- 1024 }
+    if (fileSizeMultiplierStr == "mb") { fileSizeMultiplier <- 1024 * 1024 }
+    if (fileSizeMultiplierStr == "gb") { fileSizeMultiplier <- 1024 * 1024 * 1024}
+    ## yeah right, like we are really pushing TB of data
+    if (fileSizeMultiplierStr == "tb") { fileSizeMultiplier <- 1024 * 1024 * 1024 * 1024 }
+    my.file at filesize <- fileBaseSize * fileSizeMultiplier
+
+    if (kDebug) {
+      cat("i will read file: ", my.file at filename, ' ',
+        my.file at guardLabel, ' ',
+        my.file at filesizeLabel, ' ',
+        my.file at filesize, "\n")
+    }
+
+    files <- c(files, my.file)
+  }
+
+  ## sanity check arguments
+  if (start >= end) {
+    cat("error: start date must be before end date\n");
+    quit("no", 1)
+  }
+  if (length(files) == 0) {
+    cat("error: input files must be specified as arguments\n")
+    quit("no", 1) ## terminate with non-zero errlev
+  }
+
+  if (kDebug) {
+    cat("filtering from ", as.character.POSIXt(start), " to ",
+        as.character.POSIXt(end), "\n")
+  }
+
+  ## Turn a given Torperf file into a data frame with the information we care
+  ## about.
+  read <- function(filename, guards, filesize, bytes) {
+    x <- read.table(filename)
+    x <- x[as.POSIXct(x$V1, origin = "1970-01-01") >= start &
+           as.POSIXct(x$V1, origin = "1970-01-01") <= end, ]
+    if (length(x$V1) == 0)
+        NULL
+    else
+        data.frame(
+                   started = as.POSIXct(x$V1, origin = "1970-01-01"),
+                   timeout = x$V17 == 0,
+                   failure = x$V17 > 0 & x$V20 < bytes,
+                   completemillis = ifelse(x$V17 > 0 & x$V20 >= bytes,
+                     round((x$V17 * 1000 + x$V18 / 1000) -
+                           (x$V1  * 1000 + x$V19 / 1000), 0), NA),
+                   guards = guards,
+                   filesize = filesize)
+  }
+
+  ## Read in files and bind them to a single data frame.
+  filtered <- NULL
+  for (file in files) {
+    if (kDebug) { cat('Processing ', file at filename, "...\n") }
+    filtered <- rbind(filtered,
+                      read(file at filename, file at guardLabel, file at filesizeLabel, file at filesize)
+                      )
+  }
+
+                                        # Write data frame to a csv file for later processing.
+  write.csv(filtered, "filtered.csv", quote = FALSE, row.names = FALSE)
+
+}
+
+FilterMain(commandArgs(TRUE))
diff --git a/metrics/timematrix.R b/metrics/timematrix.R
new file mode 100644
index 0000000..ec01a25
--- /dev/null
+++ b/metrics/timematrix.R
@@ -0,0 +1,36 @@
+# Load ggplot library without printing out stupid warnings.
+options(warn = -1)
+suppressPackageStartupMessages(library("ggplot2"))
+
+# Read in filtered data.
+data <- read.csv("filtered.csv", stringsAsFactors = FALSE)
+
+# Remove NA's
+data <- na.omit(data)
+
+# Remove "outliers"
+data <- data[(data$filesize == "50kb" & data$completemillis < 60000) |
+             (data$filesize == "1mb" & data$completemillis < 120000) |
+             (data$filesize == "5mb" & data$completemillis < 300000), ]
+
+# Plot a matrix of scatter plots; the first step is to define which data
+# we want to plot (here: data) and what to put on x and y axis.
+ggplot(data, aes(x = as.POSIXct(started), y = completemillis / 1000)) +
+
+# Draw a point for every observation, but with an alpha value of 1/10 to
+# reduce overplotting
+geom_point(alpha = 1/10) +
+
+# Draw a matrix of these graphs with different filesizes and different
+# guards.
+facet_grid(filesize ~ guards, scales = "free_y") +
+
+# Rename y axis.
+scale_y_continuous(name = "Completion time in seconds") +
+
+# Rename x axis.
+scale_x_datetime(name = "Starting time")
+
+# Save the result to a large PNG file.
+ggsave("timematrix.png", width = 10, height = 10, dpi = 150)
+



More information about the tor-commits mailing list