Difference between revisions of "Rebooting XCP VMs the hard way"

From PDP/Grid Wiki
Jump to navigationJump to search
Line 68: Line 68:
 
  /opt/xensource/debug/destroy_domain -domid 20
 
  /opt/xensource/debug/destroy_domain -domid 20
  
To resurrect, the host can be restarted again with the high-level xe commands. '''However...''' sometimes the virtual block device remains locked, and the host won't restart. In that case, the vbd needs to be removed with the command 'vdi-forget'. And re-attached to the same vm (note the UUIDs!)
+
To resurrect, the host can be restarted again with the high-level xe commands. '''However...''' sometimes the virtual block device remains locked, and the host won't restart.
  
  xe vdi-list | grep -C2 laars
+
== Step 5: forget the virtual disk, and find it again ==
note the vdi.
+
 
 +
In case the VM won't restart, the VDIs (virtual disk images) that were attached to the machine need to be reset. The command 'vdi-forget' will make XCP forget all about a VDI,
 +
including that it was ever associated with a VM! So note the UUID of the VDI and of the storage repository (SR) before forgetting it.
 +
 
 +
  xe vm-disk-list name-label=laars
 +
Note the VDI(s), and their SRs.
 
  xe vdi-forget uuid=b93fe358-ec12-420b-b59b-e7de2cfa6dfe
 
  xe vdi-forget uuid=b93fe358-ec12-420b-b59b-e7de2cfa6dfe
  
Adding is left as an exercise for the HOD...
+
Perform a rescan of the SR to make the VDI available again.
 +
 
 +
xe sr-scan uuid=<UUID-of-SR>
 +
 
 +
Re-add the VDI to the VM.
 +
 
 +
xe vbd-create vdi-uuid=<UUID-of-VDI> device=1 vm-uuid=<UUID-of-VM>
 +
 
 +
The 'device' is one of the allowed vbd device numbers which can be obtained from
 +
xe vm-param-get param-name=allowed-VBD-devices uuid=<UUID-of-VM>
 +
 
 +
 
 +
'''NOTE''' everything in step 5 is untested.
 +
 
 +
'''Note''' there is a command vdi-unlock, don't know if it works.

Revision as of 14:27, 1 August 2014

It could happen that a machine becomes unresponsive. Services appear down, sometimes ping still works but ssh doesn't. If the machine in question is a virtual machine, this guide will explain where the virtual power switch is and how to toggle it.

This guide strictly discusses the XCP cluster setup.

Step 1: find the host

Production machines run on the XCP pool 'piet', and logging in to the pool is done with

ssh root@pool-piet.inst.ipmi.nikhef.nl

The XCP commands all start with 'xe'. On-line help is available by typing

xe help --all

and

xe help <command>

To find the unresponsive host (called 'laars' for the sake of an example) type

xe vm-list name-label=laars.nikhef.nl
uuid ( RO)           : bc60eb6c-2d1c-4d24-18c5-38c1d525d5ea
     name-label ( RW): laars.nikhef.nl
    power-state ( RO): running

take note of the uuid; some commands require the uuid for reference.

Step 2: try the console

Machines are configured with a serial console, and sometimes it is possible to log in even when other services fail.

xe console name-label=laars.nikhef.nl

If this does not help, try a shutdown of the machine

xe vm-shutdown name-label=laars.nikhef.nl

followed (later) by a vm-start command.

In some cases the OOM killer has such a stranglehold over the system not even a shutdown comes through. In that case, the only way is a forced shutdown with a low-level command from the host where the VM is running.

Step 3: find the host of the VM

List the parameters of the vm (use the UUID here):

xe vm-param-list uuid=bc60eb6c-2d1c-4d24-18c5-38c1d525d5ea

Take note of the 'resident-on' value (the uuid of the host) and the dom-id, e.g.

resident-on ( RO): 7e6fe2e5-6d63-4865-8739-b50608a3e37a
dom-id ( RO): 20

Find the host with the host-list command

xe host-list uuid=7e6fe2e5-6d63-4865-8739-b50608a3e37a
uuid ( RO)                : 7e6fe2e5-6d63-4865-8739-b50608a3e37a
          name-label ( RW): vms-piet-15.inst.ipmi.nikhef.nl
    name-description ( RW): Default install of XenServer

So in this example vms-piet-15.inst.ipmi.nikhef.nl is where we have to log on.

Step 4: kill and resurrect the VM

Ssh to root@vms-piet-15.inst.ipmi.nikhef.nl and find the host

xn list

erf.nikhef.nl                                17   8192  2         Running 
tbn08.nikhef.nl                              19   2048  4         Running 
Control domain on host: vms-piet-15.inst.ipmi.nikhef.nl0    744   0          Running  
laars.nikhef.nl                              20   8192  2         Running 
bosui.nikhef.nl                              18   2048  2         Running 
gasbel.nikhef.nl                             15   2048  1         Running 

Kill the host

/opt/xensource/debug/destroy_domain -domid 20

To resurrect, the host can be restarted again with the high-level xe commands. However... sometimes the virtual block device remains locked, and the host won't restart.

Step 5: forget the virtual disk, and find it again

In case the VM won't restart, the VDIs (virtual disk images) that were attached to the machine need to be reset. The command 'vdi-forget' will make XCP forget all about a VDI, including that it was ever associated with a VM! So note the UUID of the VDI and of the storage repository (SR) before forgetting it.

xe vm-disk-list name-label=laars

Note the VDI(s), and their SRs.

xe vdi-forget uuid=b93fe358-ec12-420b-b59b-e7de2cfa6dfe

Perform a rescan of the SR to make the VDI available again.

xe sr-scan uuid=<UUID-of-SR>

Re-add the VDI to the VM.

xe vbd-create vdi-uuid=<UUID-of-VDI> device=1 vm-uuid=<UUID-of-VM>

The 'device' is one of the allowed vbd device numbers which can be obtained from

xe vm-param-get param-name=allowed-VBD-devices uuid=<UUID-of-VM>


NOTE everything in step 5 is untested.

Note there is a command vdi-unlock, don't know if it works.