[tor-bugs] #20331 [Metrics/Metrics website]: Speed up Metrics graphs by up to 4 seconds

Tor Bug Tracker & Wiki blackhole at torproject.org
Mon Oct 10 09:25:15 UTC 2016


#20331: Speed up Metrics graphs by up to 4 seconds
-----------------------------------------+-----------------
     Reporter:  karsten                  |      Owner:
         Type:  enhancement              |     Status:  new
     Priority:  Medium                   |  Milestone:
    Component:  Metrics/Metrics website  |    Version:
     Severity:  Normal                   |   Keywords:
Actual Points:                           |  Parent ID:
       Points:                           |   Reviewer:
      Sponsor:                           |
-----------------------------------------+-----------------
 Some of the .csv files on Metrics have grown over time, with `clients.csv`
 being the largest with a current size of 25M.  Still, we're reading these
 files every time we're drawing a graph, which may be a graph with slightly
 different parameters like start or end date.  This idea was okay when
 files were small, but it's making updates to graphs slower that they have
 to be.  On my laptop with an SSD drive it takes 4 seconds to read that
 file to memory.  I'd guess that it takes a few seconds on the server, too,
 and that's waiting time for the user between hitting the "Update graph"
 button and receiving a new graph image back.  In other words, graphs could
 be even more interactive if we resolved this issue.

 I'm aware of this issue for a while now, but I somehow only thought of
 fixing this by putting .csv files into a database, which would require a
 bit of code.  Today I came up with a simpler idea: we could simply cache
 .csv file contents in memory.  All .csv files require 89M on disk, and
 even if they grow to double the current size that still seems plausible to
 keep in memory.

 I briefly looked into the `memoise` package that's also available in
 Debian stable (`r-cran-memoise`), but the version in Debian doesn't
 support the `timeout` parameter, and I'm also not sure whether that
 mechanism would clutter memory over time by not invalidating older copies.
 But assuming we can get that running, here's a minimal example with a
 recent version of that package which does support the timeout parameter
 (with `***` where `read.csv()` returns file contents from memory):

 {{{
 > require(memoise)
 Loading required package: memoise
 > read.csv <- memoise(utils::read.csv, ~timeout(10))
 >
 > while (TRUE) {
 +   print(paste(Sys.time(), " Reading clients.csv"))
 +   read.csv("clients.csv")
 +   print(paste(Sys.time(), " Sleeping for a second"))
 +   Sys.sleep(1)
 + }
 [1] "2016-10-10 11:18:11  Reading clients.csv"
 [1] "2016-10-10 11:18:15  Sleeping for a second"
 [1] "2016-10-10 11:18:16  Reading clients.csv"     ***
 [1] "2016-10-10 11:18:16  Sleeping for a second"
 [1] "2016-10-10 11:18:17  Reading clients.csv"     ***
 [1] "2016-10-10 11:18:17  Sleeping for a second"
 [1] "2016-10-10 11:18:18  Reading clients.csv"     ***
 [1] "2016-10-10 11:18:18  Sleeping for a second"
 [1] "2016-10-10 11:18:19  Reading clients.csv"     ***
 [1] "2016-10-10 11:18:19  Sleeping for a second"
 [1] "2016-10-10 11:18:20  Reading clients.csv"
 [1] "2016-10-10 11:18:24  Sleeping for a second"
 [1] "2016-10-10 11:18:25  Reading clients.csv"     ***
 [1] "2016-10-10 11:18:25  Sleeping for a second"
 [1] "2016-10-10 11:18:26  Reading clients.csv"     ***
 [1] "2016-10-10 11:18:26  Sleeping for a second"
 [1] "2016-10-10 11:18:27  Reading clients.csv"     ***
 [1] "2016-10-10 11:18:27  Sleeping for a second"
 [1] "2016-10-10 11:18:28  Reading clients.csv"     ***
 [1] "2016-10-10 11:18:28  Sleeping for a second"
 }}}

 We could set the timeout to 1 hour or so, to ensure that processed .csv
 file contents are at most 1 hour old.  These files are only updated twice
 per day anyway, so that should be fine.

 A minor downside is that the first user in a given hour will experience a
 delay of a few seconds for reading the .csv file to memory.  This could be
 removed by an approach where we're periodically checking whether file
 contents have changed.

 Something else I didn't evaluate are race conditions.  What if we're
 reading a .csv file to memory and a second request comes in requiring us
 to read that .csv file?

 Final note: let's not overengineer this.  If we switch to another graphing
 engine we'll lose this effort.  Still, it seems like low-hanging fruit to
 eliminate this delay of a few seconds.

--
Ticket URL: <https://trac.torproject.org/projects/tor/ticket/20331>
Tor Bug Tracker & Wiki <https://trac.torproject.org/>
The Tor Project: anonymity online


More information about the tor-bugs mailing list