Difference between revisions of "User talk:Gertp"

From PDP/Grid Wiki
Jump to navigationJump to search
 
(10 intermediate revisions by the same user not shown)
Line 1: Line 1:
== Generic active/passive clusters ==
 
Configuring a cluster using corosync and heartbeat involves you having to write a start/stop and monitoring script for the service you are building the cluster for.
 
  
This script is very much like an "init.d" script, but you can't directly use an init.d script as heartbeat scripts use tri-state logic in stead of two-state logic. I.e., heartbeat controlled services are "running", "stopped" or "failed", whereas services controlled by init that fail are stopped and must be restarted. Heartbeat uses the third state "failed" as the trigger to migrate the service to another node in your services pool.
 
 
For a simple service consisting of one process, monitoring is easy and adaptation of an existing init.d script straighforward. Hint: use the sample
 
    /usr/lib/ocf/resource.d/heartbeat/Dummy
 
as a starting-point.
 
 
For services comprised of two or more processes, you'll have to loop over all processes, their pid and lock files to see whether processes are running and
 
correspond to the lock and pid files. Assuming that all processes are well behaved and store pid and lock files in the standard locations.
 
 
=== Before you start ===
 
You'll need the following:
 
* Two (virtual) machines with IP addresses in the same subnet, i.e. sharing the same broadcast domain, and
 
* a third IP address on the same subnet that will be tied to the service provided by the cluster, and
 
* DNS A records for all three.
 
 
== BDII setup on active/passive failover cluster ==
 
 
The bdii services are not entirely well behaved. The pid files for the slapd daemon and bdii-update daemon are not in the same place (/var/run and /var/run/bdii/db, respectively), and the init.d script for the slapd daemon contains lots of cruft that shouldn't be part of an init script to begin with (such as initialization of the database).
 
 
Therefore the heartbeat script for the bdii service is a bit of kludge, remaining as close as possible to the init.d script it is derived from (so as to make adaptations doable as the init.d script evolves).
 
 
Click for the [[BDII heartbeat script]].
 
 
== Install the cluster engine & resource manager ==
 
You need to perform the installation on each cluster node.
 
 
Add the EPEL repo to /etc/yum.repos.d:
 
    # rpm -Uhv http://download.fedora.redhat.com/pub/epel/5/x86_64/epel-release-5-4.noarch.rpm
 
 
Add the Clusterlabs repo to /etc/yum.repos.d:
 
    # wget -O /etc/yum.repos.d/pacemaker.repo http://clusterlabs.org/rpm/epel-5/clusterlabs.repo
 
 
Now have yum install the cluster engine and resource managers. This will install loads of dependencies:
 
    # yum -y install pacemaker
 
 
== Configure the cluster engine ==
 
You need to do this on each cluster node.
 
 
First, copy the sample configuration for corosync to the default configuration:
 
    # cp /etc/corosync/corosync.conf{.example,}
 
Then, change te "bindnetaddr": the network that your cluster nodes are in:
 
    # perl -p -i -e 's|(bindnetaddr:).*|\1  194.171.X.0|' /etc/corosync/corosync.conf
 
Also, append the following:
 
    # cat >>/etc/corosync/corosync.conf <<UFO
 
    aisexec {
 
            user: root
 
            group: root
 
    }
 
    service {
 
            name: pacemaker
 
            ver: 0
 
    }
 
    UFO
 
 
This tells corosync to run as root and to use the pacemaker resource manager.
 
 
Now start the cluster on one of the nodes:
 
    # /etc/init.d/corosync start
 
(For RHC[TE]'s that is:
 
    # service corosync start
 
).
 
And check if there are any errors by inspecting the /var/log/corosync.log.
 
 
If there are no errors, start corosync on all other cluster nodes. Check that all nodes are running ok, you can now enable corosync to start at boottime on all nodes:
 
    # chkconfig corosync on
 
 
=== STONITH ===
 
STONITH is the acronym for Shoot The Other Node In The Head. It is used during a failover/recovery to take the failing node offline (or turning it off). At this moment we don't need it, so turn it off:
 
    # crm configure property stonith-enabled=false
 
    # crm_verify -L
 
If corosync is running on all nodes, you need to do this on one node only.
 
 
=== The HA service IP address ===
 
The service we want to offer using the cluster must be separate from the nodes
 
own addresses, so it can migrate freely from one node to the other without users being aware of it. The service IP address must be in the same subnet as the cluster nodes are in.
 
 
Make sure that the DNS record fot the service IP address is a true A record. Once you have the IP address, assign the ClusterIP resource:
 
      # crm configure primitive ClusterIP ocf:heartbeat:IPaddr2 \
 
        params \
 
            ip=194.171.79.37 \
 
            cidr_netmask=32 \
 
        op monitor interval=10s
 
Again, if corosync is running on all nodes, you need to do this on one node only.
 
 
=== Quorum ===
 
If your cluster consists of just two nodes, you need to disable quorum. Otherwise the cluster will never deem itself online:
 
    # crm configure property no-quorum-policy=ignore
 
 
 
=== Stickiness ===
 
Stickiness determines the service migration cost. Set it to something high to prevent the service from being moved around the cluster for no reason:
 
    # crm configure rsc_defaults resource-stickiness=100
 
 
== Install BDII ==
 
This paragraph describes how to install bdii from rpms using yum. If you want to have quattor install bdii for you, refer to you local quattor guru.
 
 
Install the rpms:
 
    # yum install bdii bdii-config-site glite-yaim-bdii
 
This will pull in all required rpms.
 
 
Now, you'll need to get hold of a glite/quattor config file. Assuming you are building the cluster to replace a non-failover site bdii server, you can find it there. Otherwise you'll need to handcraft it, which is beyond the scope of this document. The config file is in:
 
    /etc/siteinfo/lcg-quattor-site-info.def
 

Latest revision as of 13:12, 5 April 2011