Difference between revisions of "Testbed Update Plan"
Line 7: | Line 7: | ||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
− | |||
=== Migration to a cloud infrastructure === | === Migration to a cloud infrastructure === |
Revision as of 15:43, 7 January 2015
Planning the update of the middleware/development test bed
The upgrade has taken place; the remaining information of this page is to be merged into Agile testbed.
Migration to a cloud infrastructure
Previous testbed cloud experiences are reported here.
Currently, using plain libvirt seems to fit most of our needs.
The machines blade13 and blade14 are setup with Debian squeeze, libvirt and KVM with a Fiber Channel link to the compellent storage over which clustered LVM is defined to share access to the pool.
The machines arrone and aulnes are setup with Debian squeeze, libvirt and KVM with an NFS storage backend on put.testbed (storage.testbed) over which clustered LVM is defined to share access.
It happens that on reboot of a machine you need to manually start the pool and/or run vgscan. It is unclear at this moment why this would happen.
virsh pool-start vmachines
error: Failed to start pool vmachines error: internal error '/sbin/vgchange -ay vmachines' exited with non-zero status 5 and signal 0: Error locking on node arrone.testbed: Volume group for uuid not found: hI7udF9MvGpKkvkympcNG42glpwtwDeqhv7xV9wKHHdv9tDxQ9j8Lhgqem1esMcA
The following sequence of commands can be useful:
vgscan
Reading all physical volumes. This may take a while... Found volume group "vmachines" using metadata type lvm2
virsh pool-start vmachines
Pool vmachines started
virsh pool-list --all
Name State Autostart ----------------------------------------- default active yes put.testbed active yes vmachines active yes
Also check the state of the LVM cluster, logs are in /var/log/cluster/. In the (rare) case the fence daemon has blocked the use of the cluster, the other machine in the cluster must be told it is ok to release the block. What the other machine is can be found in /etc/cluster/cluster.conf.
fence_ack_manual other-machine
Installing clustered lvm for live migration
The machines arrone.testbed and aulnes.testbed, running Debian stable, are now equipped with a clustered LVM setup with an iSCSI device as the backend storage. The iSCSI device is served by put.testbed running FreeNAS.
Documentation to set up CLVM is available for Red Hat and Debian, both should be comparable.
First, a cluster needs to be defined and all systems in the cluster need to use the same definition. Put the file in /etc/cluster/cluster.conf:
<cluster name="vmachines" config_version="1"> <cman expected_votes="1" two_node="1"> </cman> <clusternodes> <clusternode name="arrone.testbed" votes="1" nodeid="1"> <fence> </fence> </clusternode> <clusternode name="aulnes.testbed" votes="1" nodeid="2"> <fence> </fence> </clusternode> </clusternodes> <logging to_syslog="yes" to_logfile="yes" syslog_facility="daemon" syslog_priority="info"> </logging> <fence_daemon post_join_delay="30" /> <totem rrp_mode="none" secauth="off"/> </cluster>
The setting 'two_node' is a special case for two node clusters, because there is no sensible way to do majority voting. In case one of the machines fails, the other will block to fence the first machine (which is a manual operation in our case) but the cluster can carry on with just one machine if needs be.
The machines keep an eye on one another through multicast, and therefore it is important to remove the following line from /etc/hosts:
127.0.1.1 arrone.testbed
which the Debian installation inserted. This makes the cluster manager daemon bind the wrong device for multicasts (the loopback device).
Another snag found on installation is the missing directory /var/run/lvm, which causes the startup script of clvm to fail. Once this is fixed, run
/etc/init.d/cman start /etc/init.d/clvm start
Finally, the file
/etc/lvm/lvm.conf
needs to be edited to set
locking_type = 3
in order to use clustered locking.
Through this shared storage it is possible to do live migration of virtual machines between arrone and aulnes.
Installing Debian on blade 13 and 14 with Fiber Channel
This is a quick note to record a recent quirk. Although FC support on Debian works fine, using the multipath-tools-boot package is a bit tricky. It will update the initrd to include the multipath libraries and tools, to make it available at boot time.
This happened on blade-13; on reboot it was unable to mount the root partition (The message was 'device or resource busy') because the device mapper had somehow taken hold of the SCSI disk. By changing the root=UUID=xxxx stanza in the GRUB menu to root=/dev/dm-2 (this was guess-work) I managed to boot the system. There were probably several remedies to resolve the issue:
- rerun update-grub. This should replace the UUID= with a link to /dev/mapper/xxxx-part1
- blacklist the disk in the device mapper (and running mkinitramfs)
- remove the multipath-tools-boot package altogether.
I opted for blacklisting; this is what's in /etc/multipath.conf:
blacklist { wwid 3600508e000000000d6c6de44c0416105 }
A bonnie test of a VM with disk on local disk vs. a VM with disk on FC:
- lofarwn.testbed had its disk locally
Version 1.03e ------Sequential Output------ --Sequential Input- --Random- -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks-- Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP lofarwn 4G 20869 35 30172 5 23198 6 45957 85 510971 84 +++++ +++ ------Sequential Create------ --------Random Create-------- -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete-- files /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP 16 16759 99 +++++ +++ +++++ +++ 16810 100 +++++ +++ +++++ +++ lofarwn,4G,20869,35,30172,5,23198,6,45957,85,510971,84,+++++,+++,16,16759,99,+++++,+++,+++++,+++,16810,100,+++++,+++,+++++,+++
- ige-cert.testbed has disk on LVM via FC.
Version 1.03e ------Sequential Output------ --Sequential Input- --Random- -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks-- Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP ige-cert 4G 53384 96 216611 37 102283 24 51060 95 689474 79 +++++ +++ ------Sequential Create------ --------Random Create-------- -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete-- files /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP 16 12676 100 +++++ +++ +++++ +++ 12761 99 +++++ +++ +++++ +++ ige-cert,4G,53384,96,216611,37,102283,24,51060,95,689474,79,+++++,+++,16,12676,100,+++++,+++,+++++,+++,12761,99,+++++,+++,+++++,+++