[tor-bugs] #33785 [Internal Services/Tor Sysadmin Team]: cannot create new machines in ganeti cluster

Wed Apr 1 20:39:55 UTC 2020

#33785: cannot create new machines in ganeti cluster
-------------------------------------------------+---------------------
 Reporter:  anarcat                              |          Owner:  tpa
     Type:  defect                               |         Status:  new
 Priority:  High                                 |      Milestone:
Component:  Internal Services/Tor Sysadmin Team  |        Version:
 Severity:  Major                                |     Resolution:
 Keywords:                                       |  Actual Points:
Parent ID:                                       |         Points:
 Reviewer:                                       |        Sponsor:
-------------------------------------------------+---------------------

Comment (by anarcat):

 note that allocating the instance to specific nodes works properly:

 {{{
 root at fsn-node-01:~# gnt-instance add   -o debootstrap+buster   -t drbd
 --no-wait-for-sync   --disk 0:size=10G   --disk 1:size=2G,name=swap
 --backend-parameters memory=2g,vcpus=2   --net 0:ip=pool,network=gnt-fsn
 --no-name-check   --no-ip-check -n fsn-node-05.torproject.org:fsn-
 node-04.torproject.org test-01.torproject.org
 Wed Apr  1 20:11:54 2020  - INFO: NIC/0 inherits netparams ['br0',
 'openvswitch', '4000']
 Wed Apr  1 20:11:54 2020  - INFO: Chose IP 116.202.120.188 from network
 gnt-fsn
 Wed Apr  1 20:11:55 2020 * creating instance disks...
 Wed Apr  1 20:12:07 2020 adding instance test-01.torproject.org to cluster
 config
 Wed Apr  1 20:12:07 2020 adding disks to cluster config
 Wed Apr  1 20:12:07 2020 * checking mirrors status
 Wed Apr  1 20:12:07 2020  - INFO: - device disk/0:  2.20% done, 3m 47s
 remaining (estimated)
 Wed Apr  1 20:12:07 2020  - INFO: - device disk/1:  1.00% done, 2m 6s
 remaining (estimated)
 Wed Apr  1 20:12:07 2020 * checking mirrors status
 Wed Apr  1 20:12:08 2020  - INFO: - device disk/0:  2.40% done, 4m 16s
 remaining (estimated)
 Wed Apr  1 20:12:08 2020  - INFO: - device disk/1:  1.80% done, 1m 8s
 remaining (estimated)
 Wed Apr  1 20:12:08 2020 * pausing disk sync to install instance OS
 Wed Apr  1 20:12:08 2020 * running the instance OS create scripts...
 }}}

 creating a solo (not-DRBD) instance on the new network also works fine:

 {{{
 root at fsn-node-01:~# gnt-instance add   -o debootstrap+buster   -t plain
 --no-wait-for-sync   --disk 0:size=10G   --disk 1:size=2G,name=swap
 --backend-parameters memory=2g,vcpus=2   --net 0:ip=pool,network=gnt-
 fsn13-02   --no-name-check   --no-ip-check -n fsn-node-05.torproject.org
 test-02.torproject.org
 Wed Apr  1 20:17:03 2020  - INFO: NIC/0 inherits netparams ['br0',
 'openvswitch', '4000']
 Wed Apr  1 20:17:03 2020  - INFO: Chose IP 49.12.57.130 from network gnt-
 fsn13-02
 Wed Apr  1 20:17:04 2020 * disk 0, size 10.0G
 Wed Apr  1 20:17:04 2020 * disk 1, size 2.0G
 Wed Apr  1 20:17:04 2020 * creating instance disks...
 Wed Apr  1 20:17:05 2020 adding instance test-02.torproject.org to cluster
 config
 Wed Apr  1 20:17:05 2020 adding disks to cluster config
 Wed Apr  1 20:17:05 2020 * running the instance OS create scripts...
 Wed Apr  1 20:17:18 2020 * starting instance...
 }}}

 so this is strictly a problem related to the allocator.

 It also seems that there are ways of debugging the allocator, as explained
 here:

 https://github.com/ganeti/ganeti/wiki/Common-Issues#htools-debugging-
 hailhbal

 most notably, it suggests using the `hspace -L` command which, in our
 case, gives us worrisome warnings:

 {{{
 root at fsn-node-01:~# hspace -L
 Warning: cluster has inconsistent data:
   - node fsn-node-05.torproject.org is missing -3049 MB ram and 470 GB
 disk
   - node fsn-node-04.torproject.org is missing -5797 MB ram and 2 GB disk
   - node fsn-node-03.torproject.org is missing -14155 MB ram and 162 GB
 disk

 The cluster has 5 nodes and the following resources:
   MEM 321400, DSK 4574256, CPU 60, VCPU 240.
 There are 27 initial instances on the cluster.
 Tiered (initial size) instance spec is:
   MEM 32768, DSK 1048576, CPU 8, using disk template 'drbd'.
 Tiered allocation results:
   -   1 instances of spec MEM 19200, DSK 460800, CPU 8
   -   1 instances of spec MEM 19200, DSK 154880, CPU 8
   - most likely failure reason: FailDisk
   - initial cluster score: 7.92595903
   -   final cluster score: 7.26099873
   - memory usage efficiency: 50.50%
   -   disk usage efficiency: 85.56%
   -   vcpu usage efficiency: 57.08%
 Standard (fixed-size) instance spec is:
   MEM 128, DSK 1024, CPU 1, using disk template 'drbd'.
 Normal (fixed-size) allocation results:
   -  44 instances allocated
   - most likely failure reason: FailDisk
   - initial cluster score: 7.92595903
   -   final cluster score: 20.56542169
   - memory usage efficiency: 40.30%
   -   disk usage efficiency: 60.61%
   -   vcpu usage efficiency: 68.75%
 }}}

 i also tried creating a tracing allocator that shows its input, in
 `/usr/lib/ganeti/iallocators/hail-trace`:

 {{{
 #!/bin/sh

 cp "$1" /tmp/allocator-input.json
 /usr/lib/ganeti/iallocators/hail "$1"
 }}}

 then it can be used with the `-I hail-trace` parameter:

 {{{
 gnt-instance add   -o debootstrap+buster   -t drbd --no-wait-for-sync
 --disk 0:size=10G   --disk 1:size=2G,name=swap --backend-parameters
 memory=2g,vcpus=2   --net 0:ip=pool,network=gnt-fsn   --no-name-check
 --no-ip-check -I hail-trace test-01.torproject.org
 }}}

 that allows us to run the allocator by hand:

 {{{
 root at fsn-node-01:~# /usr/lib/ganeti/iallocators/hail --verbose /tmp
 /allocator-input.json
 Warning: cluster has inconsistent data:
   - node fsn-node-05.torproject.org is missing -3046 MB ram and 470 GB
 disk
   - node fsn-node-04.torproject.org is missing -5801 MB ram and 2 GB disk
   - node fsn-node-03.torproject.org is missing -14158 MB ram and 162 GB
 disk

 Received request: Allocate (Instance {name = "test-01.torproject.org",
 alias = "test-01.torproject.org", mem = 2048, dsk = 12544, disks = [Disk
 {dskSize = 10240, dskSpindles = Nothing},Disk {dskSize = 2048, dskSpindles
 = Nothing}], vcpus = 2, runSt = Running, pNode = 0, sNode = 0, idx = -1,
 util = DynUtil {cpuWeight = 1.0, memWeight = 1.0, dskWeight = 1.0,
 netWeight = 1.0}, movable = True, autoBalance = True, diskTemplate =
 DTDrbd8, spindleUse = 1, allTags = [], exclTags = [], dsrdLocTags =
 fromList [], locationScore = 0, arPolicy = ArNotEnabled, nics = [Nic {mac
 = Just "00:66:37:8b:0a:ba", ip = Just "pool", mode = Nothing, link =
 Nothing, bridge = Nothing, network = Just "f96e8644-a473-43db-874b-
 99f90e20af7b"}], forthcoming = False}) (AllocDetails 2 Nothing) Nothing
 {"success":false,"info":"Request failed: Group default (preferred): No
 valid allocation solutions, failure reasons: FailMem: 8, FailN1:
 12","result":[]}
 }}}

 which, interestingly, gives us the same warning.

 still not sure where that warning is coming from, but i can't help but
 wonder if the problem would go away after re-balancing the cluster.

--
Ticket URL: <https://trac.torproject.org/projects/tor/ticket/33785#comment:1>
Tor Bug Tracker & Wiki <https://trac.torproject.org/>
The Tor Project: anonymity online