Rebooting XCP VMs the hard way

From PDP/Grid Wiki
Revision as of 15:14, 8 July 2014 by Dennisvd@nikhef.nl (talk | contribs) (rudimentary guide to rebooting VMs)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigationJump to search

It could happen that a machine becomes unresponsive. Services appear down, sometimes ping still works but ssh doesn't. If the machine in question is a virtual machine, this guide will explain where the virtual power switch is and how to toggle it.

This guide strictly discusses the XCP cluster setup.

Step 1: find the host

Production machines run on the XCP pool 'piet', and logging in to the pool is done with

ssh root@pool-piet.inst.ipmi.nikhef.nl

The XCP commands all start with 'xe'. On-line help is available by typing

xe help --all

and

xe help <command>

To find the unresponsive host (called 'laars' for the sake of an example) type

xe vm-list name-label=laars.nikhef.nl
uuid ( RO)           : bc60eb6c-2d1c-4d24-18c5-38c1d525d5ea
     name-label ( RW): laars.nikhef.nl
    power-state ( RO): running

take note of the uuid; some commands require the uuid for reference.

Step 2: try the console

Machines are configured with a serial console, and sometimes it is possible to log in even when other services fail.

xe console name-label=laars.nikhef.nl

If this does not help, try a shutdown of the machine

xe vm-shutdown name-label=laars.nikhef.nl

followed (later) by a vm-start command.

In some cases the OOM killer has such a stranglehold over the system not even a shutdown comes through. In that case, the only way is a forced shutdown with a low-level command from the host where the VM is running.

Step 3: find the host of the VM

List the parameters of the vm (use the UUID here):

xe vm-param-list uuid=bc60eb6c-2d1c-4d24-18c5-38c1d525d5ea

Take note of the 'resident-on' value (the uuid of the host) and the dom-id, e.g.

resident-on ( RO): 7e6fe2e5-6d63-4865-8739-b50608a3e37a
dom-id ( RO): 20

Find the host with the host-list command

xe host-list uuid=7e6fe2e5-6d63-4865-8739-b50608a3e37a
uuid ( RO)                : 7e6fe2e5-6d63-4865-8739-b50608a3e37a
          name-label ( RW): vms-piet-15.inst.ipmi.nikhef.nl
    name-description ( RW): Default install of XenServer

So in this example vms-piet-15.inst.ipmi.nikhef.nl is where we have to log on.

Step 4: kill and resurrect the VM

Ssh to root@vms-piet-15.inst.ipmi.nikhef.nl and find the host

xn list

erf.nikhef.nl                                17   8192  2         Running 
tbn08.nikhef.nl                              19   2048  4         Running 
Control domain on host: vms-piet-15.inst.ipmi.nikhef.nl0    744   0          Running  
laars.nikhef.nl                              20   8192  2         Running 
bosui.nikhef.nl                              18   2048  2         Running 
gasbel.nikhef.nl                             15   2048  1         Running 

Kill the host

xn shutdown 20

To resurrect, the host can be restarted again with the high-level xe commands. However... sometimes the virtual block device remains locked, and the host won't restart. In that case, the vbd needs to be removed with the command 'vdi-forget'. And re-attached to the same vm (note the UUIDs!)

(Sorry, no examples yet.)