Rebooting XCP VMs the hard way

From PDP/Grid Wiki
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigationJump to search

It could happen that a machine becomes unresponsive. Services appear down, sometimes ping still works but ssh doesn't. If the machine in question is a virtual machine, this guide will explain where the virtual power switch is and how to toggle it.

This guide strictly discusses the XCP cluster setup.

Step 1: find the host

Production machines run on the XCP pool 'piet', and logging in to the pool is done with

ssh root@pool-piet.inst.ipmi.nikhef.nl

The XCP commands all start with 'xe'. On-line help is available by typing

xe help --all

and

xe help <command>

To find the unresponsive host (called 'laars' for the sake of an example) type

xe vm-list name-label=laars.nikhef.nl
uuid ( RO)           : bc60eb6c-2d1c-4d24-18c5-38c1d525d5ea
     name-label ( RW): laars.nikhef.nl
    power-state ( RO): running

take note of the uuid; some commands require the uuid for reference.

Step 2: try the console

Machines are configured with a serial console, and sometimes it is possible to log in even when other services fail.

xe console name-label=laars.nikhef.nl

If this does not help, try a shutdown of the machine

xe vm-shutdown name-label=laars.nikhef.nl

followed (later) by a vm-start command.

In some cases the OOM killer has such a stranglehold over the system not even a shutdown comes through. In that case, the only way is a forced shutdown with a low-level command from the host where the VM is running.

Step 3: find the host of the VM

List the parameters of the vm (use the UUID here):

xe vm-param-list uuid=bc60eb6c-2d1c-4d24-18c5-38c1d525d5ea

Take note of the 'resident-on' value (the uuid of the host) and the dom-id, e.g.

resident-on ( RO): 7e6fe2e5-6d63-4865-8739-b50608a3e37a
dom-id ( RO): 20

Find the host with the host-list command

xe host-list uuid=7e6fe2e5-6d63-4865-8739-b50608a3e37a
uuid ( RO)                : 7e6fe2e5-6d63-4865-8739-b50608a3e37a
          name-label ( RW): vms-piet-15.inst.ipmi.nikhef.nl
    name-description ( RW): Default install of XenServer

So in this example vms-piet-15.inst.ipmi.nikhef.nl is where we have to log on.

Step 4: kill and resurrect the VM

Ssh to root@vms-piet-15.inst.ipmi.nikhef.nl and find the host

xn list

erf.nikhef.nl                                17   8192  2         Running 
tbn08.nikhef.nl                              19   2048  4         Running 
Control domain on host: vms-piet-15.inst.ipmi.nikhef.nl0    744   0          Running  
laars.nikhef.nl                              20   8192  2         Running 
bosui.nikhef.nl                              18   2048  2         Running 
gasbel.nikhef.nl                             15   2048  1         Running 

Kill the host

/opt/xensource/debug/destroy_domain -domid 20

To resurrect, the host can be restarted again with the high-level xe commands. However... sometimes the virtual block device remains locked, and the host won't restart.

Step 5: forget the virtual disk, and find it again

In case the VM won't restart, the VDIs (virtual disk images) that were attached to the machine need to be reset. The command 'vdi-forget' will make XCP forget all about a VDI, including that it was ever associated with a VM! So note the UUID of the VDI and of the storage repository (SR) before forgetting it.

vmname=tbn05.nikhef.nl
xe vm-disk-list name-label=$vmname

Note the VDI(s), and their SRs. Store this information for later reference.

vmuuid=`xe vm-list name-label=$vmname | sed -n 's/^uuid.*: \(.*\)$/\1/p'`
vbduuid=`xe vbd-list vm-name-label=$vmname | sed -n 's/^uuid.*: \(.*\)$/\1/p'`
vdiuuid=`xe vbd-list vm-name-label=$vmname | sed -n 's/.*vdi-uuid.*: \(.*\)$/\1/p'`
sruuid=`xe vdi-list uuid=$vdiuuid | sed -n 's/.*sr-uuid.*: \(.*\)$/\1/p'`
xe vdi-forget uuid=$vdiuuid

Perform a rescan of the SR to make the VDI available again.

xe sr-scan uuid=<UUID-of-SR>

This step may complain about not being able to deactivate the SR because it is shared. But the next step seems to work anyway.

Re-add the VDI to the VM.

vbduuid=`xe vbd-create vdi-uuid=$vdiuuid device=1 vm-uuid=$vmuuid`

The 'device' is one of the allowed vbd device numbers which can be obtained from

xe vm-param-get param-name=allowed-VBD-devices uuid=$vmuuid

but unless the machine has more than one disk, this is usually just '1'.

It might be necessary to set the bootable flag on the block device.

xe vbd-param-set uuid=$vbduuid bootable=true

Then start the machine and cross fingers

xe vm-start uuid=$vmuuid


Note there is a command vdi-unlock, don't know if it works.