Rebooting XCP VMs the hard way
It could happen that a machine becomes unresponsive. Services appear down, sometimes ping still works but ssh doesn't. If the machine in question is a virtual machine, this guide will explain where the virtual power switch is and how to toggle it.
This guide strictly discusses the XCP cluster setup.
Step 1: find the host
Production machines run on the XCP pool 'piet', and logging in to the pool is done with
ssh root@pool-piet.inst.ipmi.nikhef.nl
The XCP commands all start with 'xe'. On-line help is available by typing
xe help --all
and
xe help <command>
To find the unresponsive host (called 'laars' for the sake of an example) type
xe vm-list name-label=laars.nikhef.nl uuid ( RO) : bc60eb6c-2d1c-4d24-18c5-38c1d525d5ea name-label ( RW): laars.nikhef.nl power-state ( RO): running
take note of the uuid; some commands require the uuid for reference.
Step 2: try the console
Machines are configured with a serial console, and sometimes it is possible to log in even when other services fail.
xe console name-label=laars.nikhef.nl
If this does not help, try a shutdown of the machine
xe vm-shutdown name-label=laars.nikhef.nl
followed (later) by a vm-start command.
In some cases the OOM killer has such a stranglehold over the system not even a shutdown comes through. In that case, the only way is a forced shutdown with a low-level command from the host where the VM is running.
Step 3: find the host of the VM
List the parameters of the vm (use the UUID here):
xe vm-param-list uuid=bc60eb6c-2d1c-4d24-18c5-38c1d525d5ea
Take note of the 'resident-on' value (the uuid of the host) and the dom-id, e.g.
resident-on ( RO): 7e6fe2e5-6d63-4865-8739-b50608a3e37a dom-id ( RO): 20
Find the host with the host-list command
xe host-list uuid=7e6fe2e5-6d63-4865-8739-b50608a3e37a uuid ( RO) : 7e6fe2e5-6d63-4865-8739-b50608a3e37a name-label ( RW): vms-piet-15.inst.ipmi.nikhef.nl name-description ( RW): Default install of XenServer
So in this example vms-piet-15.inst.ipmi.nikhef.nl is where we have to log on.
Step 4: kill and resurrect the VM
Ssh to root@vms-piet-15.inst.ipmi.nikhef.nl and find the host
xn list erf.nikhef.nl 17 8192 2 Running tbn08.nikhef.nl 19 2048 4 Running Control domain on host: vms-piet-15.inst.ipmi.nikhef.nl0 744 0 Running laars.nikhef.nl 20 8192 2 Running bosui.nikhef.nl 18 2048 2 Running gasbel.nikhef.nl 15 2048 1 Running
Kill the host
xn shutdown 20
To resurrect, the host can be restarted again with the high-level xe commands. However... sometimes the virtual block device remains locked, and the host won't restart. In that case, the vbd needs to be removed with the command 'vdi-forget'. And re-attached to the same vm (note the UUIDs!)
(Sorry, no examples yet.)