Difference between revisions of "Active/Passive and Active/Active Clusters"

Revision as of 14:10, 30 May 2011

Generic active/passive clusters

This document describes the configuration of simple active/passive clusters using the Linux HA stack (corosync, heartbeat, pacemaker).

An active/passive cluster for a single service we want to be highly available consists of two or more (virtual) hosts. At any one time, one of the hosts actually runs the service, while the other one is in hot standby.

For services that by their nature need a consistent data source you also need data mirroring techniques such as drbd or a shared background datastore.

This document covers clusters for services that do not need a shared background datastore.

If you have n services you can make those HA by running them on a cluster employing m hosts (m >= 2). With smaller m, you create shared HA clusters (lower cost). The higher m, the smaller the impact in case or a failover, but costs become larger. When m >= n+1 you have at lease one node as hot standby.

Before you start

You'll need the following:

Two (virtual) machines with IP addresses in the same subnet, i.e. sharing the same broadcast domain, and
a third IP address on the same subnet that will be tied to the service provided by the cluster, and
DNS A records for all three.

Resource management

Configuring a cluster using corosync and heartbeat involves you having to write a start/stop and monitoring script for the service you are building the cluster for.

This script is very much like an "init.d" script, but you can't directly use an init.d script as heartbeat scripts use tri-state logic in stead of two-state logic. I.e., heartbeat controlled services are "running", "stopped" or "failed", whereas services controlled by init that fail are stopped and must be restarted. Heartbeat uses the third state "failed" as the trigger to migrate the service to another node in your services pool.

For a simple service consisting of one process, monitoring is easy and adaptation of an existing init.d script straighforward. Hint: use the sample

    /usr/lib/ocf/resource.d/heartbeat/Dummy

as a starting-point.

The Dummy sample lives in the same directory as do all real resource manager scripts.

For services comprised of two or more processes, you'll have to loop over all processes, their pid and lock files to see whether processes are running and correspond to the lock and pid files. Assuming that all processes are well behaved and store pid and lock files in the standard locations.

BDII setup on active/passive failover cluster

The bdii services are not entirely well behaved. The pid files for the slapd daemon and bdii-update daemon are not in the same place (/var/run and /var/run/bdii/db, respectively), and the init.d script for the slapd daemon contains lots of cruft that shouldn't be part of an init script to begin with (such as initialization of the database).

Therefore the heartbeat script for the bdii service is a bit of kludge, remaining as close as possible to the init.d script it is derived from (so as to make adaptations doable as the init.d script evolves).

Click for the BDII heartbeat script.

Install the cluster engine & resource manager

You need to perform the installation on each cluster node.

Add the EPEL repo to /etc/yum.repos.d:

    # rpm -Uhv http://download.fedora.redhat.com/pub/epel/5/x86_64/epel-release-5-4.noarch.rpm

Add the Clusterlabs repo to /etc/yum.repos.d:

    # wget -O /etc/yum.repos.d/pacemaker.repo http://clusterlabs.org/rpm/epel-5/clusterlabs.repo

Now have yum install the cluster engine and resource managers. This will install some 20 to 30 dependencies:

    # yum -y install pacemaker

Configure the cluster engine

You need to do this on each cluster node.

First, copy the sample configuration for corosync to the default configuration:

    # cp /etc/corosync/corosync.conf{.example,}

Then, change te "bindnetaddr": the network that your cluster nodes are in:

    # perl -p -i -e 's|(bindnetaddr:).*|\1  194.171.X.0|' /etc/corosync/corosync.conf

Also, append the following:

    # cat >>/etc/corosync/corosync.conf <<UFO
    aisexec {
            user: root
            group: root
    }
    service {
            name: pacemaker
            ver: 0
    }
    UFO

This tells corosync to run as root and to use the pacemaker resource manager.

Now start the cluster on one of the nodes:

   # /etc/init.d/corosync start

(For RHC[TE]'s that is:

   # service corosync start

). And check if there are any errors by inspecting the /var/log/corosync.log.

If there are no errors, start corosync on all other cluster nodes. Check that all nodes are running ok, you can now enable corosync to start at boottime on all nodes:

   # chkconfig corosync on

STONITH

STONITH is the acronym for Shoot The Other Node In The Head. It is used during a failover/recovery to take the failing node offline (or turning it off). At this moment we don't need it, so turn it off:

    # crm configure property stonith-enabled=false
    # crm_verify -L

If corosync is running on all nodes, you need to do this on one node only.

The HA service IP address

The service we want to offer using the cluster must be separate from the nodes own addresses, so it can migrate freely from one node to the other without users being aware of it. The service IP address must be in the same subnet as the cluster nodes are in.

Make sure that the DNS record fot the service IP address is a true A record. Once you have the IP address, assign the ClusterIP resource:

     # crm configure primitive ClusterIP ocf:heartbeat:IPaddr2 \
       params \
           ip=a.b.c.d \
           cidr_netmask=32 \
       op monitor interval=10s

Again, if corosync is running on all nodes, you need to do this on one node only.

Quorum

If your cluster consists of just two nodes, you need to disable quorum. Otherwise the cluster will never deem itself online:

    # crm configure property no-quorum-policy=ignore

Stickiness

Stickiness determines the service migration cost. Set it to something high to prevent the service from being moved around the cluster for no reason:

   # crm configure rsc_defaults resource-stickiness=100

Install BDII

This paragraph describes how to install bdii from rpms using yum. If you want to have quattor install bdii for you, refer to you local quattor guru.

Install the rpms:

    # yum install bdii bdii-config-site glite-yaim-bdii

This will pull in all required rpms.

Now, you'll need to get hold of a glite/quattor config file. Assuming you are building the cluster to replace a non-failover site bdii server, you can find it there. Otherwise you'll need to handcraft it, which is beyond the scope of this document. The config file is in:

    /etc/siteinfo/lcg-quattor-site-info.def

Configure site bdii

Check where your bdii configuration lives. Glite assumes it is in /opt/bdii, but the rpms have tied it to /etc/bdii. To circumvent problems use "env" to override the glite settings. Now use glite/yaim to configure the site bdii:

    # env BDII_CONF=/etc/bdii/bdii.conf \
         /opt/glite/yaim/bin/yaim -c -s /etc/siteinfo/lcg-quattor-site-info.def -n BDII_site

Tie bdii to the cluster ip address

By default the site bdii slapd daemon will run on the IP address for "hostname -f". But that's not what you want since clients would then have to know all IP addresses that bdii could be running on and try them in succession (and that makes the clustering approach rather pointless).

Instead, edit /etc/sysconfig/bdii, and add the IP address the bdii should be tied to as:

    BDII_IP=a.b.c.d

where a.b.c.d is the third IP address mentioned in the paragraph "before you start". The IP adress may also be a resolvable hostname.

Add site bdii to the cluster

The site bdii is now configured and ready for deployment. So add it to the cluster:

    # crm configure primitive bdii ocf:heartbeat:bdii

Wait a few seconds, then check whether the bdii is running:

    # crm_mon -1

You should see the resource bdii be listed online.

Tie it together

The configuration sofar does not guarantee that both resources (bdii and IPaddr2) stay on the same node in the cluster. And since bdii uses the ip address from the IPaddr2 resource, they should stay on the same node! So tie them together:

    # crm configure colocation bdii-on-cluster INFINITY: bdii ClusterIP

Disable site bdii on individual nodes

Now that the site bdii is configured in the cluster manager, you have to make sure that the bdii software will never be started by init since that would create an instance that is not cluster aware.

    # chkconfig bdii off

STONITH revisited

When the bdii service fails (i.e. slapd or the bdii updater fails) the service and the service IP address is moved to the (an) other cluster member.

Since cluster members are in fact VMs, there are no hardware -such as an IPMI capable LOM- that you can use to bring the failed node into a known state.

Therefore we configure STONITH using the suicide driver. The suicide driver reboots the failed node and once corosync is running it will detect that the shared IP address and service processes has been moved over to the other cluster node. The failed node will then become the hot standby.

Configuration is fairly simple:

    # crm configure fence-reboot stonith::suicide     
    # crm configure clone fencing fence-reboot

And enable stonith (remember we disabled it earlier):

    # crm configure property stonith-enabled="true"

After a while, "crm status" should show:

  Online: [ pakken.nikhef.nl krijgen.nikhef.nl ]

   ClusterIP      (ocf::heartbeat:IPaddr2):       Started pakken.nikhef.nl
   bdii   (ocf::heartbeat:bdii):  Started pakken.nikhef.nl
   Clone Set: fencing
       Started: [ krijgen.nikhef.nl pakken.nikhef.nl ]

The fencing measures (stonith) are present on both cluster nodes.

Note: the suicide driver is a crude "pull the plug" approach to fencing and should be avoided. Since we have a cluster without shared storage that cannot be corrupted by pulling the powerchord, it is acceptable.

ctive cluster

@@ Line 183: / Line 183: @@
 should be avoided. Since we have a cluster without shared storage that cannot be corrupted by pulling the powerchord, it is acceptable.
-== Further reading ==
+ctive cluster
 == Further reading ==
 [http://www.clusterlabs.org/doc/en-US/Pacemaker/1.1/html/Clusters_from_Scratch/ "Clusters from Scratch" by Andrew Beekhof]