[tor-bugs] #31244 [Internal Services/Tor Sysadmin Team]: long term prometheus metrics

Tor Bug Tracker & Wiki blackhole at torproject.org
Tue Sep 10 15:03:41 UTC 2019


#31244: long term prometheus metrics
-------------------------------------------------+---------------------
 Reporter:  anarcat                              |          Owner:  tpa
     Type:  enhancement                          |         Status:  new
 Priority:  Medium                               |      Milestone:
Component:  Internal Services/Tor Sysadmin Team  |        Version:
 Severity:  Normal                               |     Resolution:
 Keywords:                                       |  Actual Points:
Parent ID:                                       |         Points:
 Reviewer:                                       |        Sponsor:
-------------------------------------------------+---------------------

Comment (by anarcat):

 in #29388, i said:

 {{{
 > (1.3byte/(15s)) * 15 d * 2500 * 80  to Gibyte

   ((1,3 * byte) / (15 * second)) * (15 * jour) * 2500 * 80 =
   approx. 20,92123 gibibytes
 }}}

 If we expand this to 30d (the current retention policy), we get:

 {{{
 > 30d×1.3byte/(15s)×2500×80 to Gibyte

   (((30 * day) * (1.3 * byte)) / (15 * second)) * 2500 * 80 = approx.
 41.842461 gibibytes
 }}}

 In other words, the current server should take about 40Gibytes of storage.
 It's actually taking much less:

 {{{
 21G     /var/lib/prometheus/metrics2/
 }}}

 There are a few reasons for this:

  1. we don't have 2500 metrics, we have 1289
  2. we don't have 80 hosts, we have 75
  3. each host doesn't necessarily expose all metrics

 Regardless of 3, stripping down to 1300 metrics over 75 hosts gives an
 estimate that actually matches the current consumption, more or less:

 {{{
 > 30d×1.3byte/(15s)×1300×75 to Gibyte

   (((30 * jour) * (1,3 * byte)) / (15 * second)) * 1300 * 75 = approx.
 20,3982 gibibytes

 }}}

 So let's play with those schedules a bit. Here's the same data, but with
 hourly pulls for a year:

 {{{
 > 365d×1.3byte/(1h)×1300×75 to Gibyte

   (((365 * jour) * (1,3 * byte)) / (1 * hour)) * 1300 * 75 = approx.
 1,0340754 gibibytes

 }}}

 Holy macaroni! Only 1GB! We could keep 20 years of data with this!

 Let's see 15 minutes increments:

 {{{
 > 365d×1.3byte/(15min)×1300×75 to Gibyte

   (((365 * jour) * (1,3 * byte)) / (15 * minute)) * 1300 * 75 = approx.
 4,1363016 gibibytes

 }}}

 Still very reasonable! And 5 minutes frequency will, of course, give us:

 {{{
 > 365d×1.3byte/(5min)×1300×75 to Gibyte

   (((365 * jour) * (1,3 * byte)) / (5 * minute)) * 1300 * 75 = approx.
 12,408905 gibibytes

 }}}

 So, basically, we have this:

 || Frequency || Retention period || Storage used ||
 ||  15 second||           30 days||        20 GiB||
 ||      5 min||           10 year||       120 GiB||
 ||      5 min||            5 year||        60 GiB||
 ||      5 min||            1 year||        12 GiB||
 ||     15 min||           10 year||        40 GiB||
 ||     15 min||            5 year||        20 GiB||
 ||     15 min||            1 year||         4 GiB||
 ||     1 hour||           10 year||        10 GiB||
 ||     1 hour||            5 year||         5 GiB||
 ||     1 hour||            1 year||         1 GiB||

 So how long do we want to keep that stuff anyways? I like the 15 minutes 5
 year plan, personnally (20GB) although I *also* like the idea of just
 shoving samples every 5 minutes like we were doing with Munin, which gives
 us 12GiB, or 60 GiB over five years...

 Thoughts?

--
Ticket URL: <https://trac.torproject.org/projects/tor/ticket/31244#comment:1>
Tor Bug Tracker & Wiki <https://trac.torproject.org/>
The Tor Project: anonymity online


More information about the tor-bugs mailing list