Rebooting XCP VMs the hard way
It could happen that a machine becomes unresponsive. Services appear down, sometimes ping still works but ssh doesn't. If the machine in question is a virtual machine, this guide will explain where the virtual power switch is and how to toggle it.
This guide strictly discusses the XCP cluster setup.
Step 1: find the host
Production machines run on the XCP pool 'piet', and logging in to the pool is done with
The XCP commands all start with 'xe'. On-line help is available by typing
xe help --all
xe help <command>
To find the unresponsive host (called 'laars' for the sake of an example) type
xe vm-list name-label=laars.nikhef.nl uuid ( RO) : bc60eb6c-2d1c-4d24-18c5-38c1d525d5ea name-label ( RW): laars.nikhef.nl power-state ( RO): running
take note of the uuid; some commands require the uuid for reference.
Step 2: try the console
Machines are configured with a serial console, and sometimes it is possible to log in even when other services fail.
xe console name-label=laars.nikhef.nl
If this does not help, try a shutdown of the machine
xe vm-shutdown name-label=laars.nikhef.nl
followed (later) by a vm-start command.
In some cases the OOM killer has such a stranglehold over the system not even a shutdown comes through. In that case, the only way is a forced shutdown with a low-level command from the host where the VM is running.
Step 3: find the host of the VM
List the parameters of the vm (use the UUID here):
xe vm-param-list uuid=bc60eb6c-2d1c-4d24-18c5-38c1d525d5ea
Take note of the 'resident-on' value (the uuid of the host) and the dom-id, e.g.
resident-on ( RO): 7e6fe2e5-6d63-4865-8739-b50608a3e37a dom-id ( RO): 20
Find the host with the host-list command
xe host-list uuid=7e6fe2e5-6d63-4865-8739-b50608a3e37a uuid ( RO) : 7e6fe2e5-6d63-4865-8739-b50608a3e37a name-label ( RW): vms-piet-15.inst.ipmi.nikhef.nl name-description ( RW): Default install of XenServer
So in this example vms-piet-15.inst.ipmi.nikhef.nl is where we have to log on.
Step 4: kill and resurrect the VM
Ssh to firstname.lastname@example.org and find the host
xn list erf.nikhef.nl 17 8192 2 Running tbn08.nikhef.nl 19 2048 4 Running Control domain on host: vms-piet-15.inst.ipmi.nikhef.nl0 744 0 Running laars.nikhef.nl 20 8192 2 Running bosui.nikhef.nl 18 2048 2 Running gasbel.nikhef.nl 15 2048 1 Running
Kill the host
/opt/xensource/debug/destroy_domain -domid 20
To resurrect, the host can be restarted again with the high-level xe commands. However... sometimes the virtual block device remains locked, and the host won't restart.
Step 5: forget the virtual disk, and find it again
In case the VM won't restart, the VDIs (virtual disk images) that were attached to the machine need to be reset. The command 'vdi-forget' will make XCP forget all about a VDI, including that it was ever associated with a VM! So note the UUID of the VDI and of the storage repository (SR) before forgetting it.
vmname=tbn05.nikhef.nl xe vm-disk-list name-label=$vmname
Note the VDI(s), and their SRs. Store this information for later reference.
vmuuid=`xe vm-list name-label=$vmname | sed -n 's/^uuid.*: \(.*\)$/\1/p'` vbduuid=`xe vbd-list vm-name-label=$vmname | sed -n 's/^uuid.*: \(.*\)$/\1/p'` vdiuuid=`xe vbd-list vm-name-label=$vmname | sed -n 's/.*vdi-uuid.*: \(.*\)$/\1/p'` sruuid=`xe vdi-list uuid=$vdiuuid | sed -n 's/.*sr-uuid.*: \(.*\)$/\1/p'` xe vdi-forget uuid=$vdiuuid
Perform a rescan of the SR to make the VDI available again.
xe sr-scan uuid=<UUID-of-SR>
This step may complain about not being able to deactivate the SR because it is shared. But the next step seems to work anyway.
Re-add the VDI to the VM.
vbduuid=`xe vbd-create vdi-uuid=$vdiuuid device=1 vm-uuid=$vmuuid`
The 'device' is one of the allowed vbd device numbers which can be obtained from
xe vm-param-get param-name=allowed-VBD-devices uuid=$vmuuid
but unless the machine has more than one disk, this is usually just '1'.
It might be necessary to set the bootable flag on the block device.
xe vbd-param-set uuid=$vbduuid bootable=true
Then start the machine and cross fingers
xe vm-start uuid=$vmuuid
Note there is a command vdi-unlock, don't know if it works.