[tor-bugs] #2687 [Torperf]: Update filter.R to parse Torperf's new .mergedata format

Tue Apr 26 07:11:09 UTC 2011

#2687: Update filter.R to parse Torperf's new .mergedata format
-------------------------+--------------------------------------------------
 Reporter:  karsten      |          Owner:  karsten     
     Type:  enhancement  |         Status:  needs_review
 Priority:  major        |      Milestone:              
Component:  Torperf      |        Version:              
 Keywords:               |         Parent:              
   Points:  4            |   Actualpoints:              
-------------------------+--------------------------------------------------

Comment(by karsten):

 Pasting your email and replying to it here:

 > What I am saying is, maybe R is just the wrong tool for the really
 > string heavy stuff.  I could write a small parser in c, using lex and
 > yacc so that the parser can be an efficient state machine.  This
 > parser could then be called from an R script.  The parser does the
 > front end string processing and can dump it into the csv.  We then
 > read the csv into the R code to crunch the stats.
 >
 > Seems like this would use the best features of each tool.  I can
 > certainly make my current R approach output to csv, that is just a few
 > lines of code at the bottom.  I was focusing on testing the data
 > structure before producing text output.
 >
 > Since the input language is so simple and has a regular level grammar
 > the state machine will be super efficient since there is no need for a
 > lookahead or LR parsing the way there would be with a context free
 > grammar.  The advantage is that the state machine would be run byte by
 > byte over the input in a single pass.  Very low memory requirement
 > since you only need to buffer on an as needed basis.  You don't have
 > to read in the characters as a large matrix which may be what R does.
 > I don't know how many lines R buffers at once, but with lex and yacc
 > you know the buffer is a small constant size.  That way we really know
 > that our O(n) single pass through the text doesn't have any hidden
 > side costs.

 I haven't given up hope to use R as the tool to parse Torperf output yet.
 After all, we are free to choose whatever data format we want, so we can
 still make it fit our parsing capabilities.  What I'd really want to avoid
 is to add yet another technology to the Torperf zoo.  And parsing lines
 with key-value pairs shouldn't be a hard problem for R.

 Do you know which part of our R approaches slows us down so horribly?  I
 think R is just mad at us for using a for loop and wants us to use a real
 R thing instead.  But I haven't figured out a better way to parse stuff.

-- 
Ticket URL: <https://trac.torproject.org/projects/tor/ticket/2687#comment:11>
Tor Bug Tracker & Wiki <https://trac.torproject.org/>
The Tor Project: anonymity online