[tor-bugs] #3036 [Torperf]: Tweak Torperf's .mergedata format and make it the new default

Thu Apr 28 20:00:36 UTC 2011

#3036: Tweak Torperf's .mergedata format and make it the new default
-------------------------+--------------------------------------------------
 Reporter:  karsten      |          Owner:  karsten
     Type:  enhancement  |         Status:  new    
 Priority:  normal       |      Milestone:         
Component:  Torperf      |        Version:         
 Keywords:               |         Parent:         
   Points:               |   Actualpoints:         
-------------------------+--------------------------------------------------
 Right now, we have three Torperf data formats: the .data files containing
 the output of trivsocks-client.c, the .extradata files containing the
 output of the Python script attached to Tor's control port, and the
 .mergedata files containing the consolidation of the two formats.

 I'd like to tweak the .mergedata format to make it easier to process, and
 I want to make it the new default Torperf output format.

 Here's what I'd like to change:

  - Every data point in the new .mergedata format should contain the meta
 data that is necessary to generate Torperf graphs.  This meta data
 contains the file size, the source (moria, siv, ferrinii, etc.), and
 possibly a custom guard choice and/or custom circuit build timeout.  I
 could imagine adding these meta data as `FILESIZE=51200, SOURCE=ferrinii,
 GUARDS=slowratio, CBT=75`.

  One motivation for this change is to remove the dependency from the
 filename, which is how we currently encode these meta data, e.g.,
 `slowratio75cbt-50kb.mergedata`.

  Also, I'd like to be able to concatenate multiple Torperf files and have
 a single file for a) the standard Torperf runs of a given month and b) the
 Torperf runs from a given experiment.  This makes it easier for people to
 download and process our Torperf data.

  - We should combine the SEC and USEC fields and simply write timestamps
 as floats with a precision of, say, two decimal places, like we do in
 `LAUNCH=1302523261.18`.  For example, `STARTSEC=1302523501
 STARTUSEC=703442` would become `START=1302523501.70`.  This saves a lot of
 bytes and maybe even a few CPU cycles when parsing the single fields of a
 data point.

  - When measuring hidden service performance as in #1944, we should add
 custom fields for the various hidden service substeps, e.g.,
 `START_RENDCIRC`, `GOT_INTROCIRC`, etc.

 What do you think?  Do these changes make sense?  If so, here are the next
 steps:

  - The first step in this endeavor is to wait for the results of #2687
 where we try to implement an efficient .mergedata parser in R.

  - The next step would be to change `consolidate_stats.py` to add the new
 meta data fields and combine SEC and USEC fields for us.

  - As soon as we have the new .mergedata format, I'll update metrics-db to
 aggregate the various Torperf files and prepare them for the metrics
 website.  I'll also update metrics-web to parse the .mergedata format
 instead of the .data format.  And of course, I'll update the
 [https://metrics.torproject.org/papers/data-2011-03-14.pdf Overview of
 Statistical Data in the Tor Network] to describe the new format.

  - Once we start working on #2565, we might want to dump the .data and
 .extradata formats entirely and have Torperf only output the .mergedata
 format.

-- 
Ticket URL: <https://trac.torproject.org/projects/tor/ticket/3036>
Tor Bug Tracker & Wiki <https://trac.torproject.org/>
The Tor Project: anonymity online