Difference between revisions of "Nagios Monitoring Setup"

From PDP/Grid Wiki
Jump to navigationJump to search
 
(23 intermediate revisions by 3 users not shown)
Line 2: Line 2:
  
 
Hosts:
 
Hosts:
* spade: Nagios master server, collects information from active Nagios servers via NSCA and presents it via the web interface.
+
* spade: Nagios master server, collects information from active Nagios servers via NSCA and presents it via the web interface and actively schedules checks for grid monitoring (either via UI horige, or fetching results from SAM, GOC, GGUS, gStat).
 
* riek:  Nagios server for production grid servers (including storage servers) and generic servers.  
 
* riek:  Nagios server for production grid servers (including storage servers) and generic servers.  
 
* eg: Nagios server for the worker nodes. The check results are published via NSCA to spade.
 
* eg: Nagios server for the worker nodes. The check results are published via NSCA to spade.
 
* tbn06: Nagios server for the ITB. The check results are published via NSCA to spade.
 
* tbn06: Nagios server for the ITB. The check results are published via NSCA to spade.
* tbn12: Combination of gLite 3.1 UI (dedicated to grid monitoring) and Nagios server. The Nagios server only fetches the results of SAM tests and executes grid checks via NRPE, using itself as UI. To run the grid checks, a valid grid proxy must exist on the host. The check results are published via NSCA to spade.
+
* horige: gLite 3.2 UI dedicated to grid monitoring. Nagios server schedules the grid checks via NRPE. To run the grid checks, a valid grid proxy must exist on host horige. This UI does not run a Nagios server.
 +
 
 +
 
 +
=== Nagios configuration ===
 +
At present, all of the Nagios configuration is handled by the quattor setup. The grid-specific configuration is also defined in the quattor setup, but handled on the Nagios master server spade and the NRPE UI horige via Yaim.
 +
 
 +
At Nagios master spade, Yaim runs the script ncg.pl to regenerate the hosts, services and commands for grid checks. When new (SAM) tests appear, it may be needed to regenerate by hand:
 +
/usr/sbin/ncg.pl
 +
The configuration files for ncg.pl are present in /etc/ncg and they are generated by Yaim. The output is now written to /etc/nagios/wlcg.d (Nagios server configuration) and /etc/nagios/nrpe (NRPE configuration for the UI). The generated configuration files require some editing before they can be used:
 +
* wlcg.d/commands-edited.cfg: this is a copy of the generated file wlcg.d/commands.cfg; the 2 commands for pnp4nagios have to be commented out since they are already defined by the Quattor setup (and cannot be removed there).
 +
 
 +
After making the changes, run
 +
nagios -v /etc/nagios/nagios.cfg
 +
to verify that there are no errors. Then, the Nagios service can be restarted.
 +
 
 +
It is not needed to make any changes on the UI horige. Changes on spade are automatically propagated to horige.
 +
 
 +
Note: the quattor setup defines checks for the hosts spade.nikhef.nl and horige.nikhef.nl (with domain name) for load, disk usage etc. The Yaim-generated setup uses the short host names (spade and horige) to prevent doubly defined host names. These names should not be mixed to prevent Nagios that will refuse to start!
 +
 
 +
=== Nagios grid proxy ===
 +
 
 +
To monitor availability of grid services, some checks need to have a valid grid proxy. For this purpose, the setup uses a robot proxy that is generated from an eToken on host kudde (note: access control via Dennis or Jan Just). The generated VOMS proxy is for the VO ops.biggrid.nl and is stored on MyProxy server wierde. More information about the eToken setup is found in the dedicated article [[Using an Aladdin eToken PRO to store grid certificates]]
 +
 
 +
The checks that need this proxy are scheduled from the master Nagios server (spade), although they are executed via NRPE on the restricted User Interface horige. The path to the active proxy at horige is /etc/nagios/globus/userproxy.pem-ops.biggrid.nl. This proxy is periodically renewed via the check hr.srce.GridProxy-Get-ops.biggrid.nl running at the Nagios master 'spade'. Renewal is based on a valid host certificate and requires the UI to be whitelisted in the MyProxy server 'wierde'.
 +
 
 +
 
 +
----
 +
 
 +
=== Nagios grid proxy - OBSOLETED ===
 +
'''
 +
The information below is no longer actual. It is kept only for reference.'''
 +
 
 +
The goal here is to get a valid grid dteam proxy registered in a myproxy server on weirde.
 +
There is a script which does this (don't try to execute it yet, please keep reading):
 +
~ronalds/bin/nagios-refresh
 +
 
 +
This script relies on a valid dteam proxy being present at the UI machine "horige"
 +
in the location
 +
 
 +
/etc/nagios/globus/userproxy.pem-dteam
 +
 
 +
There are four cases:
 +
# no proxy exists at the myproxy on weirde
 +
# a valid proxy exists at weirde and you want to renew this proxy (the same grid credentials)
 +
# you want to use proxy at weirde with different credentials (the Ronald is leaving soon for vacation use case)
 +
# Same as number 3, but Ronald has already left for Hugharda.
 +
 
 +
== No proxy exists ==
 +
 
 +
If no valid proxy exists (anymore), create a regular dteam proxy on any UI:
 +
voms-proxy-init -voms dteam
 +
and copy it to the above location:
 +
scp /tmp/x509up_u500 root@horige:/etc/nagios/globus/userproxy.pem-dteam
 +
Login as root@horige and change the ownership and group to nagios:nagios:
 +
chown nagios:nagios /etc/nagios/globus/userproxy.pem-dteam
 +
 
 +
Then (not as root anymore, as yourself on any UI machine) register a dteam proxy in the MyProxy server wierde
 +
using the script.
 +
~ronalds/bin/nagios-refresh
 +
 
 +
== Renewing an existing proxy, same credentials ==
 +
 
 +
Follow exactly the same steps as in the 'no proxy exists case'.
 +
 
 +
 
 +
== Using a proxy based on other credentials ==
 +
 
 +
The user whose proxy is currently at wierde (the one you no longer wish to use) will have to, using his/her credentials,
 +
remove that proxy.  This can be done by using the command:
 +
GT_PROXY_MODE=old myproxy-destroy -s wierde -l nagios -k NagiosRetrieve-horige-dteam
 +
 
 +
After that has been done, follow the procedure for the 'no proxy exists' case.
 +
 
 +
== Using a proxy based on other credentials, original user unreachable ==
 +
 
 +
In this case, one has to forcibly remove the proxy registered at wierde, and once that is done, follow the 'no proxy exists' case.
 +
 
 +
Only apply the emergency procedure if the proxy registered at the MyProxy server has expired and the owner of that proxy is not available to renew it!
 +
 
 +
Login as root at MyProxy server wierde and move the existing credentials out of the way:
 +
 
 +
cd /var/myproxy/
 +
ls -l nagios-NagiosRetrieve-horige-dteam.*
 +
-rw-------  1 root root 4256 Apr 11 10:02 nagios-NagiosRetrieve-horige-dteam.creds
 +
-rw-------  1 root root  203 Apr 11 10:02 nagios-NagiosRetrieve-horige-dteam.data
 +
mkdir OLD
 +
mv nagios-NagiosRetrieve-horige-dteam.* OLD/
 +
 
 +
At this point you're ready to use the 'no proxy exists' procedure above.
 +
 
 +
== Verification of the validity of the currently registered proxy ==
 +
* Point your browser to the location of the Nagios server at spade
 +
* Show host wierde
 +
* Check service hr.srce.MyProxy-ProxyLifetime-dteam
 +
* The check result will show the date and time of expiration and an estimate of the time left as measured on the moment of execution the check.
 +
* It is possible to force the check of service hr.srce.MyProxy-ProxyLifetime-dteam to get an update (e.g. after renewing the registration of the proxy) by selecting "Re-schedule the next check of this service" in the panel on the right.
 +
 
 +
When the proxy will expire in less than 24 hours, the check result for hr.srce.MyProxy-ProxyLifetime-dteam will be a warning.
 +
 
 +
External documentation:
 +
 
 +
[https://twiki.cern.ch/twiki/bin/view/EGEE/GridMonitoringNcgYaim Yaim Based Installation of Nagios & NCG]

Latest revision as of 13:31, 15 November 2013

This article describes the current setup of the Nagios monitoring setup. Note that this setup is being modified, so the information described here may soon be outdated.

Hosts:

  • spade: Nagios master server, collects information from active Nagios servers via NSCA and presents it via the web interface and actively schedules checks for grid monitoring (either via UI horige, or fetching results from SAM, GOC, GGUS, gStat).
  • riek: Nagios server for production grid servers (including storage servers) and generic servers.
  • eg: Nagios server for the worker nodes. The check results are published via NSCA to spade.
  • tbn06: Nagios server for the ITB. The check results are published via NSCA to spade.
  • horige: gLite 3.2 UI dedicated to grid monitoring. Nagios server schedules the grid checks via NRPE. To run the grid checks, a valid grid proxy must exist on host horige. This UI does not run a Nagios server.


Nagios configuration

At present, all of the Nagios configuration is handled by the quattor setup. The grid-specific configuration is also defined in the quattor setup, but handled on the Nagios master server spade and the NRPE UI horige via Yaim.

At Nagios master spade, Yaim runs the script ncg.pl to regenerate the hosts, services and commands for grid checks. When new (SAM) tests appear, it may be needed to regenerate by hand:

/usr/sbin/ncg.pl

The configuration files for ncg.pl are present in /etc/ncg and they are generated by Yaim. The output is now written to /etc/nagios/wlcg.d (Nagios server configuration) and /etc/nagios/nrpe (NRPE configuration for the UI). The generated configuration files require some editing before they can be used:

  • wlcg.d/commands-edited.cfg: this is a copy of the generated file wlcg.d/commands.cfg; the 2 commands for pnp4nagios have to be commented out since they are already defined by the Quattor setup (and cannot be removed there).

After making the changes, run

nagios -v /etc/nagios/nagios.cfg

to verify that there are no errors. Then, the Nagios service can be restarted.

It is not needed to make any changes on the UI horige. Changes on spade are automatically propagated to horige.

Note: the quattor setup defines checks for the hosts spade.nikhef.nl and horige.nikhef.nl (with domain name) for load, disk usage etc. The Yaim-generated setup uses the short host names (spade and horige) to prevent doubly defined host names. These names should not be mixed to prevent Nagios that will refuse to start!

Nagios grid proxy

To monitor availability of grid services, some checks need to have a valid grid proxy. For this purpose, the setup uses a robot proxy that is generated from an eToken on host kudde (note: access control via Dennis or Jan Just). The generated VOMS proxy is for the VO ops.biggrid.nl and is stored on MyProxy server wierde. More information about the eToken setup is found in the dedicated article Using an Aladdin eToken PRO to store grid certificates

The checks that need this proxy are scheduled from the master Nagios server (spade), although they are executed via NRPE on the restricted User Interface horige. The path to the active proxy at horige is /etc/nagios/globus/userproxy.pem-ops.biggrid.nl. This proxy is periodically renewed via the check hr.srce.GridProxy-Get-ops.biggrid.nl running at the Nagios master 'spade'. Renewal is based on a valid host certificate and requires the UI to be whitelisted in the MyProxy server 'wierde'.



Nagios grid proxy - OBSOLETED

The information below is no longer actual. It is kept only for reference.

The goal here is to get a valid grid dteam proxy registered in a myproxy server on weirde. There is a script which does this (don't try to execute it yet, please keep reading):

~ronalds/bin/nagios-refresh 

This script relies on a valid dteam proxy being present at the UI machine "horige" in the location

/etc/nagios/globus/userproxy.pem-dteam

There are four cases:

  1. no proxy exists at the myproxy on weirde
  2. a valid proxy exists at weirde and you want to renew this proxy (the same grid credentials)
  3. you want to use proxy at weirde with different credentials (the Ronald is leaving soon for vacation use case)
  4. Same as number 3, but Ronald has already left for Hugharda.

No proxy exists

If no valid proxy exists (anymore), create a regular dteam proxy on any UI:

voms-proxy-init -voms dteam

and copy it to the above location:

scp /tmp/x509up_u500 root@horige:/etc/nagios/globus/userproxy.pem-dteam

Login as root@horige and change the ownership and group to nagios:nagios:

chown nagios:nagios /etc/nagios/globus/userproxy.pem-dteam

Then (not as root anymore, as yourself on any UI machine) register a dteam proxy in the MyProxy server wierde using the script.

~ronalds/bin/nagios-refresh

Renewing an existing proxy, same credentials

Follow exactly the same steps as in the 'no proxy exists case'.


Using a proxy based on other credentials

The user whose proxy is currently at wierde (the one you no longer wish to use) will have to, using his/her credentials, remove that proxy. This can be done by using the command:

GT_PROXY_MODE=old myproxy-destroy -s wierde -l nagios -k NagiosRetrieve-horige-dteam

After that has been done, follow the procedure for the 'no proxy exists' case.

Using a proxy based on other credentials, original user unreachable

In this case, one has to forcibly remove the proxy registered at wierde, and once that is done, follow the 'no proxy exists' case.

Only apply the emergency procedure if the proxy registered at the MyProxy server has expired and the owner of that proxy is not available to renew it!

Login as root at MyProxy server wierde and move the existing credentials out of the way:

cd /var/myproxy/
ls -l nagios-NagiosRetrieve-horige-dteam.*
-rw-------  1 root root 4256 Apr 11 10:02 nagios-NagiosRetrieve-horige-dteam.creds
-rw-------  1 root root  203 Apr 11 10:02 nagios-NagiosRetrieve-horige-dteam.data
mkdir OLD
mv nagios-NagiosRetrieve-horige-dteam.* OLD/

At this point you're ready to use the 'no proxy exists' procedure above.

Verification of the validity of the currently registered proxy

  • Point your browser to the location of the Nagios server at spade
  • Show host wierde
  • Check service hr.srce.MyProxy-ProxyLifetime-dteam
  • The check result will show the date and time of expiration and an estimate of the time left as measured on the moment of execution the check.
  • It is possible to force the check of service hr.srce.MyProxy-ProxyLifetime-dteam to get an update (e.g. after renewing the registration of the proxy) by selecting "Re-schedule the next check of this service" in the panel on the right.

When the proxy will expire in less than 24 hours, the check result for hr.srce.MyProxy-ProxyLifetime-dteam will be a warning.

External documentation:

Yaim Based Installation of Nagios & NCG