[tor-bugs] #33785 [Internal Services/Tor Sysadmin Team]: cannot create new machines in ganeti cluster

Tor Bug Tracker & Wiki blackhole at torproject.org
Thu Apr 2 14:54:21 UTC 2020


#33785: cannot create new machines in ganeti cluster
-------------------------------------------------+-------------------------
 Reporter:  anarcat                              |          Owner:  anarcat
     Type:  defect                               |         Status:
                                                 |  assigned
 Priority:  High                                 |      Milestone:
Component:  Internal Services/Tor Sysadmin Team  |        Version:
 Severity:  Major                                |     Resolution:
 Keywords:                                       |  Actual Points:
Parent ID:                                       |         Points:
 Reviewer:                                       |        Sponsor:
-------------------------------------------------+-------------------------

Comment (by anarcat):

 some feedback from a ganeti maintainer:

 {{{
 03:40:48 <apoikos> failure reasons: FailMem: 1, FailN1: 4
 03:41:18 <apoikos> part indicates that there's no N+1 redundancy, probably
 due to not enough memory being available on the cluster to accommodate it
 03:42:05 <apoikos> You can try a manual allocation, or passing flags like
 --ignore-soft-errors and --no-capacity-checks to hail
 [...]
 10:36:12 <apoikos> I doubt rebalancing will fix it
 10:36:31 <apoikos> The thing is, the whole htools logic was built around
 Xen which does hard commit on memory
 [...]
 10:37:08 <apoikos> That's the -14GB of RAm you're seeing
 10:37:11 <anarchat> so what you're saying is that i *am* effectively using
 too much memory
 10:37:13 <anarchat> oh weird
 10:37:24 <anarchat> like the memory use from /proc doesn't match what
 ganeti expects?
 10:37:28 <apoikos> no, I'm saying you're using less memory than Ganeti
 thinks
 10:37:34 <apoikos> exactly
 10:37:41 <apoikos> because KVMs VSZ != RSS
 [...]
 10:38:17 <apoikos> Let's say it computes the worst-case scenario
 10:38:46 <apoikos> And in the worst-case scenario, where each instance
 will indeed use all of its configured memory and KSM won't save you, you
 don't have N+1
 10:39:08 <apoikos> As for the 162GB of disk, these are probably your root
 LVs, if they live on the same LVM VG as the Ganeti instance disks
 10:39:39 <anarchat> well there's also a secondary VG (vg_ganeti_hdd) for
 spinning rust that we don't see in gnt-node-list
 10:39:48 <anarchat> i wonder if that's related
 10:39:52 <apoikos> nope
 10:40:08 <apoikos> If your primary VG has anything else than Ganeti VMs on
 it, you'll see that message
 10:40:20 <anarchat> darn
 10:40:27 <anarchat> so i'd need to rebuild my nodes to fix this
 10:40:34 <apoikos> the good news is, you can tell ganeti to ignore
 specific LVs using gnt-cluster modify --reserved-lvs
 10:40:41 <anarchat> oh cool
 10:41:19 <anarchat> so i'd ignore what... vg_ganeti/root and
 vg_ganeti/swap i guess
 10:41:29 <apoikos> I guess
 10:41:50 <apoikos> The option --reserved-lvs specifies a list (comma-
 separated) of logical volume group names (regular expressions) that will
 be ignored by the
                    cluster verify operation
 10:41:53 <anarchat> i alreayd have   lvm reserved volumes: vg_ganeti/root,
 vg_ganeti/swap
 10:42:19 <anarchat> oh but maybe i have extra LVs on those nodes, that's
 true
 10:43:47 <anarchat> on fsn-node-03 and fsn-node-05, but not fsn-node-04
 }}}

 they also noted [https://github.com/ganeti/ganeti/issues/1399 upstream
 issue 1399] which is that the Sinst field is incorrect in `gnt-node list`.

--
Ticket URL: <https://trac.torproject.org/projects/tor/ticket/33785#comment:3>
Tor Bug Tracker & Wiki <https://trac.torproject.org/>
The Tor Project: anonymity online


More information about the tor-bugs mailing list