[tor-bugs] #2687 [Torperf]: Update filter.R to parse Torperf's new .mergedata format

Tue Apr 26 09:03:57 UTC 2011

#2687: Update filter.R to parse Torperf's new .mergedata format
-------------------------+--------------------------------------------------
 Reporter:  karsten      |          Owner:  karsten     
     Type:  enhancement  |         Status:  needs_review
 Priority:  major        |      Milestone:              
Component:  Torperf      |        Version:              
 Keywords:               |         Parent:              
   Points:  4            |   Actualpoints:              
-------------------------+--------------------------------------------------

Comment(by rransom):

 Replying to [comment:11 karsten]:
 > Pasting your email and replying to it here:
 >
 > > What I am saying is, maybe R is just the wrong tool for the really
 > > string heavy stuff.  I could write a small parser in c, using lex and
 > > yacc so that the parser can be an efficient state machine.  This
 > > parser could then be called from an R script.  The parser does the
 > > front end string processing and can dump it into the csv.  We then
 > > read the csv into the R code to crunch the stats.
 > >
 > > Seems like this would use the best features of each tool.  I can
 > > certainly make my current R approach output to csv, that is just a few
 > > lines of code at the bottom.  I was focusing on testing the data
 > > structure before producing text output.

 Yes, it would be a good approach if R really couldn't parse your file
 efficiently.

 > > Since the input language is so simple and has a regular level grammar
 > > the state machine will be super efficient since there is no need for a
 > > lookahead or LR parsing the way there would be with a context free
 > > grammar.  The advantage is that the state machine would be run byte by
 > > byte over the input in a single pass.  Very low memory requirement
 > > since you only need to buffer on an as needed basis.  You don't have
 > > to read in the characters as a large matrix which may be what R does.

 ''Not'' likely.  It looks very much like R just reads in a line at a time
 using whatever buffering stdio provides.

 > > I don't know how many lines R buffers at once, but with lex and yacc
 > > you know the buffer is a small constant size.  That way we really know
 > > that our O(n) single pass through the text doesn't have any hidden
 > > side costs.

 It's not a single pass through the text.  Each time you process an input
 line, you copy all of the preceding lines:

 {{{
 117         mergedata_vector <- c(mergedata_vector, my.mergedata)
 }}}

 That's O(n^2^) right there.

-- 
Ticket URL: <https://trac.torproject.org/projects/tor/ticket/2687#comment:12>
Tor Bug Tracker & Wiki <https://trac.torproject.org/>
The Tor Project: anonymity online