[tor-bugs] #2687 [Torperf]: Update filter.R to parse Torperf's new .mergedata format

Sun Apr 24 10:32:11 UTC 2011

#2687: Update filter.R to parse Torperf's new .mergedata format
-------------------------+--------------------------------------------------
 Reporter:  karsten      |          Owner:  karsten     
     Type:  enhancement  |         Status:  needs_review
 Priority:  major        |      Milestone:              
Component:  Torperf      |        Version:              
 Keywords:               |         Parent:              
   Points:  4            |   Actualpoints:              
-------------------------+--------------------------------------------------

Comment(by karsten):

 Thanks for attaching your code.  It's very interesting to read someone
 else's R code.  I learn a lot by doing so. :)

 However, I'm afraid neither of our attempts are sufficient yet.  Here are
 a few comments:

  - I'd rather want to avoid adding another dependency with the "stringr"
 package, unless we have to.  I replaced the str_* functions with standard
 R functions, e.g., unlist(strsplit()), which seemed to work.

  - Writing to CSV doesn't work yet.  I'd like to know if we can export the
 parsed data easily.

  - The major issue is that parsing takes much too long. I parsed 1 week of
 50 KB downloads containing 4247 rows, 2013 of which are measurements.
 Your script took 2:54 minutes for this task or 1:47 minutes when using the
 standard R functions instead of the stringr stuff.  My script takes 0:25
 minutes for this task which is still far too much.  For comparison,
 reading the output CSV file takes only 314 milliseconds.  People will want
 to parse months or even years of data coming from a dozen Torperf runs or
 more.  This shouldn't take hours, but minutes.  So, we should aim for at
 most a few seconds for the week of data.  Plus, the script should scale
 linearly for more data, which I'm not sure is the case for our attempts.

  If we cannot find an efficient way to parse these files, let's take one
 step back.  What data formats are there that allow us to add or remove
 columns easily and that can be parsed efficiently in R?  CSV is fast, but
 inflexible.  Space-separated key-value-pairs are flexible, but apparently
 slow.  What else is there?  XML?

-- 
Ticket URL: <https://trac.torproject.org/projects/tor/ticket/2687#comment:9>
Tor Bug Tracker & Wiki <https://trac.torproject.org/>
The Tor Project: anonymity online