Active/Passive and Active/Active Clusters
Generic active/passive clusters
This document describes the configuration of simple active/passive clusters using the Linux HA stack (corosync, heartbeat, pacemaker).
An active/passive cluster for a single service we want to be highly available consists of two or more (virtual) hosts. At any one time, one of the hosts actually runs the service, while the other one is in hot standby.
For services that by their nature need a consistent data source you also need data mirroring techniques such as drbd or a shared background datastore.
This document covers clusters for services that do not need a shared background datastore.
If you have n services you can make those HA by running them on a cluster employing m hosts (m >= 2). With smaller m, you create shared HA clusters (lower cost). The higher m, the smaller the impact in case or a failover, but costs become larger. When m >= n+1 you have at lease one node as hot standby.
Before you start
You'll need the following:
- Two (virtual) machines with IP addresses in the same subnet, i.e. sharing the same broadcast domain, and
- a third IP address on the same subnet that will be tied to the service provided by the cluster, and
- DNS A records for all three.
Resource management
Configuring a cluster using corosync and heartbeat involves you having to write a start/stop and monitoring script for the service you are building the cluster for.
This script is very much like an "init.d" script, but you can't directly use an init.d script as heartbeat scripts use tri-state logic in stead of two-state logic. I.e., heartbeat controlled services are "running", "stopped" or "failed", whereas services controlled by init that fail are stopped and must be restarted. Heartbeat uses the third state "failed" as the trigger to migrate the service to another node in your services pool.
For a simple service consisting of one process, monitoring is easy and adaptation of an existing init.d script straighforward. Hint: use the sample
/usr/lib/ocf/resource.d/heartbeat/Dummy
as a starting-point.
The Dummy sample lives in the same directory as do all real resource manager scripts.
For services comprised of two or more processes, you'll have to loop over all processes, their pid and lock files to see whether processes are running and correspond to the lock and pid files. Assuming that all processes are well behaved and store pid and lock files in the standard locations.
BDII setup on active/passive failover cluster
The bdii services are not entirely well behaved. The pid files for the slapd daemon and bdii-update daemon are not in the same place (/var/run and /var/run/bdii/db, respectively), and the init.d script for the slapd daemon contains lots of cruft that shouldn't be part of an init script to begin with (such as initialization of the database).
Therefore the heartbeat script for the bdii service is a bit of kludge, remaining as close as possible to the init.d script it is derived from (so as to make adaptations doable as the init.d script evolves).
Click for the BDII heartbeat script.
Install the cluster engine & resource manager
You need to perform the installation on each cluster node.
Add the EPEL repo to /etc/yum.repos.d:
# rpm -Uhv http://download.fedora.redhat.com/pub/epel/5/x86_64/epel-release-5-4.noarch.rpm
Add the Clusterlabs repo to /etc/yum.repos.d:
# wget -O /etc/yum.repos.d/pacemaker.repo http://clusterlabs.org/rpm/epel-5/clusterlabs.repo
Now have yum install the cluster engine and resource managers. This will install some 20 to 30 dependencies:
# yum -y install pacemaker
Configure the cluster engine
You need to do this on each cluster node.
First, copy the sample configuration for corosync to the default configuration:
    # cp /etc/corosync/corosync.conf{.example,}
Then, change te "bindnetaddr": the network that your cluster nodes are in:
# perl -p -i -e 's|(bindnetaddr:).*|\1 194.171.X.0|' /etc/corosync/corosync.conf
Also, append the following:
    # cat >>/etc/corosync/corosync.conf <<UFO
    aisexec {
            user: root
            group: root
    }
    service {
            name: pacemaker
            ver: 0
    }
    UFO
This tells corosync to run as root and to use the pacemaker resource manager.
Now start the cluster on one of the nodes:
# /etc/init.d/corosync start
(For RHC[TE]'s that is:
# service corosync start
). And check if there are any errors by inspecting the /var/log/corosync.log.
If there are no errors, start corosync on all other cluster nodes. Check that all nodes are running ok, you can now enable corosync to start at boottime on all nodes:
# chkconfig corosync on
STONITH
STONITH is the acronym for Shoot The Other Node In The Head. It is used during a failover/recovery to take the failing node offline (or turning it off). At this moment we don't need it, so turn it off:
    # crm configure property stonith-enabled=false
    # crm_verify -L
If corosync is running on all nodes, you need to do this on one node only.
The HA service IP address
The service we want to offer using the cluster must be separate from the nodes own addresses, so it can migrate freely from one node to the other without users being aware of it. The service IP address must be in the same subnet as the cluster nodes are in.
Make sure that the DNS record fot the service IP address is a true A record. Once you have the IP address, assign the ClusterIP resource:
     # crm configure primitive ClusterIP ocf:heartbeat:IPaddr2 \
       params \
           ip=a.b.c.d \
           cidr_netmask=32 \
       op monitor interval=10s
Again, if corosync is running on all nodes, you need to do this on one node only.
Quorum
If your cluster consists of just two nodes, you need to disable quorum. Otherwise the cluster will never deem itself online:
# crm configure property no-quorum-policy=ignore
Stickiness
Stickiness determines the service migration cost. Set it to something high to prevent the service from being moved around the cluster for no reason:
# crm configure rsc_defaults resource-stickiness=100
Install BDII
This paragraph describes how to install bdii from rpms using yum. If you want to have quattor install bdii for you, refer to you local quattor guru.
Install the rpms:
# yum install bdii bdii-config-site glite-yaim-bdii
This will pull in all required rpms.
Now, you'll need to get hold of a glite/quattor config file. Assuming you are building the cluster to replace a non-failover site bdii server, you can find it there. Otherwise you'll need to handcraft it, which is beyond the scope of this document. The config file is in:
/etc/siteinfo/lcg-quattor-site-info.def
Configure site bdii
Check where your bdii configuration lives. Glite assumes it is in /opt/bdii, but the rpms have tied it to /etc/bdii. To circumvent problems use "env" to override the glite settings. Now use glite/yaim to configure the site bdii:
    # env BDII_CONF=/etc/bdii/bdii.conf \
         /opt/glite/yaim/bin/yaim -c -s /etc/siteinfo/lcg-quattor-site-info.def -n BDII_site
Tie bdii to the cluster ip address
By default the site bdii slapd daemon will run on the IP address for "hostname -f". But that's not what you want since clients would then have to know all IP addresses that bdii could be running on and try them in succession (and that makes the clustering approach rather pointless).
Instead, edit /etc/sysconfig/bdii, and add the IP address the bdii should be tied to as:
BDII_IP=a.b.c.d
where a.b.c.d is the third IP address mentioned in the paragraph "before you start". The IP adress may also be a resolvable hostname.
Add site bdii to the cluster
The site bdii is now configured and ready for deployment. So add it to the cluster:
# crm configure primitive bdii ocf:heartbeat:bdii
Wait a few seconds, then check whether the bdii is running:
# crm_mon -1
You should see the resource bdii be listed online.
Tie it together
The configuration sofar does not guarantee that both resources (bdii and IPaddr2) stay on the same node in the cluster. And since bdii uses the ip address from the IPaddr2 resource, they should stay on the same node! So tie them together:
# crm configure colocation bdii-on-cluster INFINITY: bdii ClusterIP
Disable site bdii on individual nodes
Now that the site bdii is configured in the cluster manager, you have to make sure that the bdii software will never be started by init since that would create an instance that is not cluster aware.
# chkconfig bdii off
STONITH revisited
When the bdii service fails (i.e. slapd or the bdii updater fails) the service and the service IP address is moved to the (an) other cluster member.
Since cluster members are in fact VMs, there are no hardware -such as an IPMI capable LOM- that you can use to bring the failed node into a known state.
Therefore we configure STONITH using the suicide driver. The suicide driver reboots the failed node and once corosync is running it will detect that the shared IP address and service processes has been moved over to the other cluster node. The failed node will then become the hot standby.
Configuration is fairly simple:
    # crm configure primitive fence-reboot stonith::suicide     
    # crm configure clone fencing fence-reboot 
And enable stonith (remember we disabled it earlier):
# crm configure property stonith-enabled="true"
After a while, "crm status" should show:
Online: [ pakken.nikhef.nl krijgen.nikhef.nl ]
   ClusterIP      (ocf::heartbeat:IPaddr2):       Started pakken.nikhef.nl
   bdii   (ocf::heartbeat:bdii):  Started pakken.nikhef.nl
   Clone Set: fencing
       Started: [ krijgen.nikhef.nl pakken.nikhef.nl ]
The fencing measures (stonith) are present on both cluster nodes.
Note: the suicide driver is a crude "pull the plug" approach to fencing and should be avoided. Since we have a cluster without shared storage that cannot be corrupted by pulling the powerchord, it is acceptable.
Proceeding to an active/active cluster
The cluster we created so far is an active/passive one. The service runs on one node, and as soon it fails the other node takes over. Bear in mind: this setup only works without shared storage (drbd/gfs/nfs) as long as:
- the service only sources information (not sync it, i.e. store it)
- the service is quick in rebuilding the information, so users do not experience a long delay during failover.
As long as the above two prerequisites are met, we can safely convert the cluster to an active/active, i.e., loadbalancing configuration.
To do this, proceed as follows (on one of the cluster-nodes):
- clone the clusterip resource so we have an extra one using the same ipaddress,
- add a clusterip_hash function bases on the source ipaddress, so requests from different clients are handled by different server processes. This step involves editing the ClusterIP resource and cannot be scripted, it seems.
- clone the bdii resource
Clone the ClusterIP resource:
# crm configure clone ClusterIP_clone ClusterIP meta globally-unique=true clone-max=2 clone-node-max=2 commit
Now edit ClusterIP:
# crm configure edit ClusterIP beyond the params keyword add: clusterip_hash="sourceip" :wq commit
Now clone bdii:
# crm configure clone bdii_clone bdii commit
Use crm_mon to see two bdii services, both handling requests.
# crm_mon ============ Last updated: Mon May 30 14:27:07 2011 Stack: openais Current DC: krijgen.nikhef.nl - partition with quorum Version: 1.0.10-da7075976b5ff0bee71074385f8fd02f296ec8a3 2 Nodes configured, 2 expected votes 3 Resources configured. ============
Online: [ pakken.nikhef.nl krijgen.nikhef.nl ]
    Clone Set: fencing
        Started: [ krijgen.nikhef.nl pakken.nikhef.nl ]
    Clone Set: ClusterIP_clone (unique)
        ClusterIP:0        (ocf::heartbeat:IPaddr2):       Started pakken.nikhef.nl
        ClusterIP:1        (ocf::heartbeat:IPaddr2):       Started krijgen.nikhef.nl
    Clone Set: bdii_clone
        Started: [ pakken.nikhef.nl krijgen.nikhef.nl ]
Active/Active Cluster network internals
Clustered servers share the same IP address. To accomplish this they use the iptables module "CLUSTERIP". CLUSTERIP uses a so called MAC level multicast address on all nodes (an ethernet address which has the least significant bit in the most significant byte set)
Note that your network devices must support this (some C and J brand routers seem not to like mac level multicasting).
Scripted setup
To simplify setup, you may use the scripts from this section.
First, run setup_corosync on both clusternodes-to-be:
# ./setup_corosync X.Y.Z.0 clusternumber
X.Y.Z.0 being the network the cluster is in. The clusternumber counts sequentially from 0 (inclusive). It is used to pick the ports for the totem ring protocol, so different cluster can coexist on the same multicast network without interference.
This script is generic, it does not depend on the service being clustered.
Then, run setup_heartbeat_bdii:
# ./setup_heartbeat_bdii X.Y.Z.N
This installs the bdii startup and monitor script for use by heartbeat, and configures bdii to use the shared ip address i.s.o. the local hostaddress. This script depends on bdii and should be adapted for each service you want to cluster.
Then run the setup_crm script on one (any) of the nodes:
    # ./setup_crm bdii X.Y.Z.N {activeactive|activepassive}
The first argument is the service to be configured, and X.Y.Z.N is the shared ip address where the service is located. The third argument sets the cluster mode. This script is generic.
The scripts are available here:  setup_corosync; setup_heartbeat_bdii; setup_crm.
You also need the BDII heartbeat script and it must be in the same directory.
