[tor-bugs] #29388 [Internal Services/Tor Sysadmin Team]: Find out requirements for running Prometheus

Thu Mar 7 19:59:28 UTC 2019

#29388: Find out requirements for running Prometheus
-------------------------------------------------+-------------------------
 Reporter:  ln5                                  |          Owner:  anarcat
     Type:  task                                 |         Status:
                                                 |  assigned
 Priority:  Medium                               |      Milestone:
Component:  Internal Services/Tor Sysadmin Team  |        Version:
 Severity:  Normal                               |     Resolution:
 Keywords:                                       |  Actual Points:
Parent ID:  #29389                               |         Points:
 Reviewer:                                       |        Sponsor:
-------------------------------------------------+-------------------------

Comment (by anarcat):

 okay, I had an interesting conversation with folks in #prometheus on
 freenode about the topic of downsampling. the prom folks argue that
 downsampling is not necessary because the TSDB (time-series database)
 compresses samples very efficiently. apparently, the "worst-case
 compresision is 1.3 bytes per sample" which means that, for a year of
 samples taken every minute, you get:

 {{{
 > 1.3byte/minute * year

   (1,3 * (byte / minute)) * year = 683,748 kilobytes

 }}}

 ... that is: 683KB/sample/year. A typical "node exporter" site has about
 2500 metrics, times our current ~80 host setup, means an entire year of
 samples would take up 136GB (127GiB):

 {{{
 > 1.3byte/minute * year * 2500 * 80

   (1,3 * (byte / minute)) * year * 2500 * 80 = 136,7496 gigabytes

 > 1.3byte/minute * year * 2500 * 80 to Gibyte

   (1,3 * (byte / minute)) * year * 2500 * 80 = approx. 127,35799 gibibytes

 }}}

 This is actually not bad at all in terms of total amount of storage. The
 problem they identified is more the performance impact of doing historical
 queries. He (SuperQ) said that queries can take as much as 20-30s when the
 cache is cold and ~500ms when hot. But I guess this is something that can
 also be figured out later.

 Still, the disk usage difference with Munin is kind of dramatic. Here's
 the disk usage on my home server, running with three targets. You can see
 Prometheus (yellow line) slowly filling up the disk up to its retention
 limit (~300 days, using around 12GiB) while Munin (green line) stays
 steadily at 380MB.

 [[Image(prometheus-2-0-stats.png, 700px)]]

 That's with about four targets. If we would extrapolate that to Tor's
 setup with 80 targets, that would give us 240GiB of disk use, about double
 of the above estimate. That might be related to the fact that I didn't
 change the sample rate: I stuck to the 15 second scrape interval, while
 the above calculations used 60 seconds intervals. I would therefore have
 expected to have around 25 GiB disk used (127GiB/20 * 4) instead, which
 goes to show Prometheus is actually pretty good at compressing those
 samples.

 Server memory would also go a long way in generating responsive graphs: if
 the "hot" part of the database can be cached to memory, it will make
 queries go much faster. Hopefully we'll rarely do queries over an entire
 year and will not need hundreds of GiB of memory.

 Now, this brings us back to downsampling: if we *do* want to have year-
 long queries happening frequently, then we'll stumble upon those slow
 queries from time to time, so we'll need to find a solution to that
 problem, which, for the record, was determined to be
 [https://github.com/prometheus/tsdb/issues/313 out of scope] by the
 Prometheus team.

 Traditionally, that solution in Prom land is
 [https://github.com/prometheus/prometheus/blob/master/docs/federation.md
 Federation]: simply have a second server that pulls from the first with a
 different sampling frequency. So can have a first server that pulls
 everyone every 15 seconds and keep two weeks of sample and then the second
 server pulls the other every day, a third pulls the second every month,
 etc... This complicates the setup as it requires multiple servers to be
 setup but also means there are now multiple data sources to parse. Grafana
 *does* support parsing multiple datasources, but it makes panels more
 complicated and most won't work out of the box with such a setup.

 Others have come up with solutions for this:

  * Digital Ocean wrote a tool called
 [https://github.com/digitalocean/vulcan Vulcan] for this (now abandoned)
  * some Prometheus folks started the [https://github.com/improbable-
 eng/thanos Thanos project] ([https://improbable.io/games/blog/thanos-
 prometheus-at-scale good introduction]) that builds on top of Prometheus
 to enable downsampling
  * another project that was mentioned in the chat is
 [https://github.com/cortexproject/cortex Cortex], a "multitenant,
 horizontally scalable Prometheus as a Service" that seems to be
 specifically designed for Kubernetes
  * finally, there's a tool called [https://github.com/rapidloop/sop sop]
 ([https://www.rapidloop.com/blog/prometheus-metrics-archiving.html
 introduction]) that can extract samples from a Prometheus instance and
 archive it into "something else" like OpenTSDB, InfluxDB, or another
 Prometheus server, after downsampling

 All those solutions add more complexity to a tool that's already not very
 familiar to the team, so I would suggest we first start deploying
 Prometheus with its normal retention period (15 days) and scraping
 interval (15 seconds) and see what that brings us. This would give us a
 fairly reasonable 20GiB of disk usage to start with:

 {{{
 > (1.3byte/(15s)) * 15 d * 2500 * 80  to Gibyte

   ((1,3 * byte) / (15 * second)) * (15 * jour) * 2500 * 80 =
   approx. 20,92123 gibibytes
 }}}

 Obviously, this would be extremely fast if it all lived in memory, but I
 think we could also get away with a 1 to 10 (2GB) or 1 to 5 (4GB) ratio of
 memory.

 So, long story short, we should expect Prometheus to use:

  * 2GB of RAM
  * 30GB of disk

 to start with, with possibly much more disk space (~10x) growth over time.
 Latency is of course critical so it would preferable to run this on SSD
 drives at least.

--
Ticket URL: <https://trac.torproject.org/projects/tor/ticket/29388#comment:3>
Tor Bug Tracker & Wiki <https://trac.torproject.org/>
The Tor Project: anonymity online