Rebooting XCP VMs the hard way

From PDP/Grid Wiki
Jump to navigationJump to search

It could happen that a machine becomes unresponsive. Services appear down, sometimes ping still works but ssh doesn't. If the machine in question is a virtual machine, this guide will explain where the virtual power switch is and how to toggle it.

This guide strictly discusses the XCP cluster setup.

Step 1: find the host

Production machines run on the XCP pool 'piet', and logging in to the pool is done with


The XCP commands all start with 'xe'. On-line help is available by typing

xe help --all


xe help <command>

To find the unresponsive host (called 'laars' for the sake of an example) type

xe vm-list
uuid ( RO)           : bc60eb6c-2d1c-4d24-18c5-38c1d525d5ea
     name-label ( RW):
    power-state ( RO): running

take note of the uuid; some commands require the uuid for reference.

Step 2: try the console

Machines are configured with a serial console, and sometimes it is possible to log in even when other services fail.

xe console

If this does not help, try a shutdown of the machine

xe vm-shutdown

followed (later) by a vm-start command.

In some cases the OOM killer has such a stranglehold over the system not even a shutdown comes through. In that case, the only way is a forced shutdown with a low-level command from the host where the VM is running.

Step 3: find the host of the VM

List the parameters of the vm (use the UUID here):

xe vm-param-list uuid=bc60eb6c-2d1c-4d24-18c5-38c1d525d5ea

Take note of the 'resident-on' value (the uuid of the host) and the dom-id, e.g.

resident-on ( RO): 7e6fe2e5-6d63-4865-8739-b50608a3e37a
dom-id ( RO): 20

Find the host with the host-list command

xe host-list uuid=7e6fe2e5-6d63-4865-8739-b50608a3e37a
uuid ( RO)                : 7e6fe2e5-6d63-4865-8739-b50608a3e37a
          name-label ( RW):
    name-description ( RW): Default install of XenServer

So in this example is where we have to log on.

Step 4: kill and resurrect the VM

Ssh to and find the host

xn list                                17   8192  2         Running                              19   2048  4         Running 
Control domain on host: vms-piet-15.inst.ipmi.nikhef.nl0    744   0          Running                              20   8192  2         Running                              18   2048  2         Running                             15   2048  1         Running 

Kill the host

/opt/xensource/debug/destroy_domain -domid 20

To resurrect, the host can be restarted again with the high-level xe commands. However... sometimes the virtual block device remains locked, and the host won't restart.

Step 5: forget the virtual disk, and find it again

In case the VM won't restart, the VDIs (virtual disk images) that were attached to the machine need to be reset. The command 'vdi-forget' will make XCP forget all about a VDI, including that it was ever associated with a VM! So note the UUID of the VDI and of the storage repository (SR) before forgetting it.
xe vm-disk-list name-label=$vmname

Note the VDI(s), and their SRs. Store this information for later reference.

vmuuid=`xe vm-list name-label=$vmname | sed -n 's/^uuid.*: \(.*\)$/\1/p'`
vbduuid=`xe vbd-list vm-name-label=$vmname | sed -n 's/^uuid.*: \(.*\)$/\1/p'`
vdiuuid=`xe vbd-list vm-name-label=$vmname | sed -n 's/.*vdi-uuid.*: \(.*\)$/\1/p'`
sruuid=`xe vdi-list uuid=$vdiuuid | sed -n 's/.*sr-uuid.*: \(.*\)$/\1/p'`
xe vdi-forget uuid=$vdiuuid

Perform a rescan of the SR to make the VDI available again.

xe sr-scan uuid=<UUID-of-SR>

This step may complain about not being able to deactivate the SR because it is shared. But the next step seems to work anyway.

Re-add the VDI to the VM.

vbduuid=`xe vbd-create vdi-uuid=$vdiuuid device=1 vm-uuid=$vmuuid`

The 'device' is one of the allowed vbd device numbers which can be obtained from

xe vm-param-get param-name=allowed-VBD-devices uuid=$vmuuid

but unless the machine has more than one disk, this is usually just '1'.

It might be necessary to set the bootable flag on the block device.

xe vbd-param-set uuid=$vbduuid bootable=true

Then start the machine and cross fingers

xe vm-start uuid=$vmuuid

Note there is a command vdi-unlock, don't know if it works.