Difference between revisions of "Active/Passive and Active/Active Clusters"

Revision as of 16:20, 31 May 2011

Generic active/passive clusters

This document describes the configuration of simple active/passive clusters using the Linux HA stack (corosync, heartbeat, pacemaker).

An active/passive cluster for a single service we want to be highly available consists of two or more (virtual) hosts. At any one time, one of the hosts actually runs the service, while the other one is in hot standby.

For services that by their nature need a consistent data source you also need data mirroring techniques such as drbd or a shared background datastore.

This document covers clusters for services that do not need a shared background datastore.

If you have n services you can make those HA by running them on a cluster employing m hosts (m >= 2). With smaller m, you create shared HA clusters (lower cost). The higher m, the smaller the impact in case or a failover, but costs become larger. When m >= n+1 you have at lease one node as hot standby.

Before you start

You'll need the following:

Two (virtual) machines with IP addresses in the same subnet, i.e. sharing the same broadcast domain, and
a third IP address on the same subnet that will be tied to the service provided by the cluster, and
DNS A records for all three.

Resource management

Configuring a cluster using corosync and heartbeat involves you having to write a start/stop and monitoring script for the service you are building the cluster for.

This script is very much like an "init.d" script, but you can't directly use an init.d script as heartbeat scripts use tri-state logic in stead of two-state logic. I.e., heartbeat controlled services are "running", "stopped" or "failed", whereas services controlled by init that fail are stopped and must be restarted. Heartbeat uses the third state "failed" as the trigger to migrate the service to another node in your services pool.

For a simple service consisting of one process, monitoring is easy and adaptation of an existing init.d script straighforward. Hint: use the sample

    /usr/lib/ocf/resource.d/heartbeat/Dummy

as a starting-point.

The Dummy sample lives in the same directory as do all real resource manager scripts.

For services comprised of two or more processes, you'll have to loop over all processes, their pid and lock files to see whether processes are running and correspond to the lock and pid files. Assuming that all processes are well behaved and store pid and lock files in the standard locations.

BDII setup on active/passive failover cluster

The bdii services are not entirely well behaved. The pid files for the slapd daemon and bdii-update daemon are not in the same place (/var/run and /var/run/bdii/db, respectively), and the init.d script for the slapd daemon contains lots of cruft that shouldn't be part of an init script to begin with (such as initialization of the database).

Therefore the heartbeat script for the bdii service is a bit of kludge, remaining as close as possible to the init.d script it is derived from (so as to make adaptations doable as the init.d script evolves).

Click for the BDII heartbeat script.

Install the cluster engine & resource manager

You need to perform the installation on each cluster node.

Add the EPEL repo to /etc/yum.repos.d:

    # rpm -Uhv http://download.fedora.redhat.com/pub/epel/5/x86_64/epel-release-5-4.noarch.rpm

Add the Clusterlabs repo to /etc/yum.repos.d:

    # wget -O /etc/yum.repos.d/pacemaker.repo http://clusterlabs.org/rpm/epel-5/clusterlabs.repo

Now have yum install the cluster engine and resource managers. This will install some 20 to 30 dependencies:

    # yum -y install pacemaker

Configure the cluster engine

You need to do this on each cluster node.

First, copy the sample configuration for corosync to the default configuration:

    # cp /etc/corosync/corosync.conf{.example,}

Then, change te "bindnetaddr": the network that your cluster nodes are in:

    # perl -p -i -e 's|(bindnetaddr:).*|\1  194.171.X.0|' /etc/corosync/corosync.conf

Also, append the following:

    # cat >>/etc/corosync/corosync.conf <<UFO
    aisexec {
            user: root
            group: root
    }
    service {
            name: pacemaker
            ver: 0
    }
    UFO

This tells corosync to run as root and to use the pacemaker resource manager.

Now start the cluster on one of the nodes:

   # /etc/init.d/corosync start

(For RHC[TE]'s that is:

   # service corosync start

). And check if there are any errors by inspecting the /var/log/corosync.log.

If there are no errors, start corosync on all other cluster nodes. Check that all nodes are running ok, you can now enable corosync to start at boottime on all nodes:

   # chkconfig corosync on

STONITH

STONITH is the acronym for Shoot The Other Node In The Head. It is used during a failover/recovery to take the failing node offline (or turning it off). At this moment we don't need it, so turn it off:

    # crm configure property stonith-enabled=false
    # crm_verify -L

If corosync is running on all nodes, you need to do this on one node only.

The HA service IP address

The service we want to offer using the cluster must be separate from the nodes own addresses, so it can migrate freely from one node to the other without users being aware of it. The service IP address must be in the same subnet as the cluster nodes are in.

Make sure that the DNS record fot the service IP address is a true A record. Once you have the IP address, assign the ClusterIP resource:

     # crm configure primitive ClusterIP ocf:heartbeat:IPaddr2 \
       params \
           ip=a.b.c.d \
           cidr_netmask=32 \
       op monitor interval=10s

Again, if corosync is running on all nodes, you need to do this on one node only.

Quorum

If your cluster consists of just two nodes, you need to disable quorum. Otherwise the cluster will never deem itself online:

    # crm configure property no-quorum-policy=ignore

Stickiness

Stickiness determines the service migration cost. Set it to something high to prevent the service from being moved around the cluster for no reason:

   # crm configure rsc_defaults resource-stickiness=100

Install BDII

This paragraph describes how to install bdii from rpms using yum. If you want to have quattor install bdii for you, refer to you local quattor guru.

Install the rpms:

    # yum install bdii bdii-config-site glite-yaim-bdii

This will pull in all required rpms.

Now, you'll need to get hold of a glite/quattor config file. Assuming you are building the cluster to replace a non-failover site bdii server, you can find it there. Otherwise you'll need to handcraft it, which is beyond the scope of this document. The config file is in:

    /etc/siteinfo/lcg-quattor-site-info.def

Configure site bdii

Check where your bdii configuration lives. Glite assumes it is in /opt/bdii, but the rpms have tied it to /etc/bdii. To circumvent problems use "env" to override the glite settings. Now use glite/yaim to configure the site bdii:

    # env BDII_CONF=/etc/bdii/bdii.conf \
         /opt/glite/yaim/bin/yaim -c -s /etc/siteinfo/lcg-quattor-site-info.def -n BDII_site

Tie bdii to the cluster ip address

By default the site bdii slapd daemon will run on the IP address for "hostname -f". But that's not what you want since clients would then have to know all IP addresses that bdii could be running on and try them in succession (and that makes the clustering approach rather pointless).

Instead, edit /etc/sysconfig/bdii, and add the IP address the bdii should be tied to as:

    BDII_IP=a.b.c.d

where a.b.c.d is the third IP address mentioned in the paragraph "before you start". The IP adress may also be a resolvable hostname.

Add site bdii to the cluster

The site bdii is now configured and ready for deployment. So add it to the cluster:

    # crm configure primitive bdii ocf:heartbeat:bdii

Wait a few seconds, then check whether the bdii is running:

    # crm_mon -1

You should see the resource bdii be listed online.

Tie it together

The configuration sofar does not guarantee that both resources (bdii and IPaddr2) stay on the same node in the cluster. And since bdii uses the ip address from the IPaddr2 resource, they should stay on the same node! So tie them together:

    # crm configure colocation bdii-on-cluster INFINITY: bdii ClusterIP

Disable site bdii on individual nodes

Now that the site bdii is configured in the cluster manager, you have to make sure that the bdii software will never be started by init since that would create an instance that is not cluster aware.

    # chkconfig bdii off

STONITH revisited

When the bdii service fails (i.e. slapd or the bdii updater fails) the service and the service IP address is moved to the (an) other cluster member.

Since cluster members are in fact VMs, there are no hardware -such as an IPMI capable LOM- that you can use to bring the failed node into a known state.

Therefore we configure STONITH using the suicide driver. The suicide driver reboots the failed node and once corosync is running it will detect that the shared IP address and service processes has been moved over to the other cluster node. The failed node will then become the hot standby.

Configuration is fairly simple:

    # crm configure fence-reboot stonith::suicide     
    # crm configure clone fencing fence-reboot

And enable stonith (remember we disabled it earlier):

    # crm configure property stonith-enabled="true"

After a while, "crm status" should show:

  Online: [ pakken.nikhef.nl krijgen.nikhef.nl ]

   ClusterIP      (ocf::heartbeat:IPaddr2):       Started pakken.nikhef.nl
   bdii   (ocf::heartbeat:bdii):  Started pakken.nikhef.nl
   Clone Set: fencing
       Started: [ krijgen.nikhef.nl pakken.nikhef.nl ]

The fencing measures (stonith) are present on both cluster nodes.

Note: the suicide driver is a crude "pull the plug" approach to fencing and should be avoided. Since we have a cluster without shared storage that cannot be corrupted by pulling the powerchord, it is acceptable.

Proceeding to an active/active cluster

The cluster we created so far is an active/passive one. The service runs on one node, and as soon it fails the other node takes over. Bear in mind: this setup only works without shared storage (drbd/gfs/nfs) as long as:

the service only sources information (not sync it, i.e. store it)
the service is quick in rebuilding the information, so users do not experience a long delay during failover.

As long as the above two prerequisites are met, we can safely convert the cluster to an active/active, i.e., loadbalancing configuration.

To do this, proceed as follows (on one of the cluster-nodes):

clone the clusterip resource so we have an extra one using the same ipaddress,
add a clusterip_hash function bases on the source ipaddress, so requests from different clients are handled by different server processes. This step involves editing the ClusterIP resource and cannot be scripted, it seems.
clone the bdii resource

Clone the ClusterIP resource:

   # crm configure
   clone ClusterIP_clone ClusterIP meta globally-unique=true clone-max=2 clone-node-max=2
   commit

Now edit ClusterIP:

   # crm configure
   edit ClusterIP
   beyond the params keyword add:  clusterip_hash="sourceip"
   :wq
   commit

Now clone bdii:

   # crm configure
   clone bdii_clone bdii
   commit

Use crm_mon to see two bdii services, both handling requests.

   # crm_mon
   ============
   Last updated: Mon May 30 14:27:07 2011
   Stack: openais
   Current DC: krijgen.nikhef.nl - partition with quorum
   Version: 1.0.10-da7075976b5ff0bee71074385f8fd02f296ec8a3
   2 Nodes configured, 2 expected votes
   3 Resources configured.
   ============

   Online: [ pakken.nikhef.nl krijgen.nikhef.nl ]

    Clone Set: fencing
        Started: [ krijgen.nikhef.nl pakken.nikhef.nl ]
    Clone Set: ClusterIP_clone (unique)
        ClusterIP:0        (ocf::heartbeat:IPaddr2):       Started pakken.nikhef.nl
        ClusterIP:1        (ocf::heartbeat:IPaddr2):       Started krijgen.nikhef.nl
    Clone Set: bdii_clone
        Started: [ pakken.nikhef.nl krijgen.nikhef.nl ]

Active/Active Cluster network internals

Clustered servers share the same IP address. To accomplish this they use the iptables module "CLUSTERIP". CLUSTERIP uses a so called MAC level multicast address on all nodes (an ethernet address which has the least significant bit in the most significant byte set)

Note that your network devices must support this (some C and J brand routers seem not to like mac level multicasting).

Scripted setup

To simplify setup, you may use the scripts from this section.

First, run setup_corosync_bdii on both clusternodes-to-be:

    # ./setup_corosync_bdii X.Y.Z.0

X.Y.Z.0 being the network the cluster is in.

Then run the setup_bdii script:

    # ./setup_bdii X.Y.Z.N {activeactive|activepassive}

X.Y.Z.N being the shared ip address where the service is located. The second argument sets the cluster mode.