Difference between revisions of "Nagios Monitoring Setup"

From PDP/Grid Wiki
Jump to navigationJump to search
m
Line 2: Line 2:
  
 
Hosts:
 
Hosts:
* spade: Nagios master server, collects information from active Nagios servers via NSCA and presents it via the web interface.
+
* spade: Nagios master server, collects information from active Nagios servers via NSCA and presents it via the web interface and actively schedules checks for grid monitoring (either via UI horige, or fetching results from SAM, GOC, GGUS, gStat).
 
* riek:  Nagios server for production grid servers (including storage servers) and generic servers.  
 
* riek:  Nagios server for production grid servers (including storage servers) and generic servers.  
 
* eg: Nagios server for the worker nodes. The check results are published via NSCA to spade.
 
* eg: Nagios server for the worker nodes. The check results are published via NSCA to spade.
 
* tbn06: Nagios server for the ITB. The check results are published via NSCA to spade.
 
* tbn06: Nagios server for the ITB. The check results are published via NSCA to spade.
* tbn12: Combination of gLite 3.1 UI (dedicated to grid monitoring) and Nagios server. The Nagios server only fetches the results of SAM tests and executes grid checks via NRPE, using itself as UI. To run the grid checks, a valid grid proxy must exist on the host. The check results are published via NSCA to spade.
+
* horige: gLite 3.2 UI dedicated to grid monitoring. Nagios server schedules the grid checks via NRPE. To run the grid checks, a valid grid proxy must exist on host horige. This UI does not run a Nagios server.
  
  
 
=== Nagios configuration ===
 
=== Nagios configuration ===
At present, all of the Nagios configuration is handled by the quattor setup, except for the grid-specific checks. These checks are generated by the script ncg.pl at host tbn12 via the following command:
+
At present, all of the Nagios configuration is handled by the quattor setup, except for the grid-specific checks. These checks are generated by the script ncg.pl at host spade via the following command:
  MYPROXY_SERVER=wierde.nikhef.nl SITE_NAME=NIKHEF-ELPROD ncg.pl
+
  ncg.pl
The configuration files for ncg.pl are present in /etc/ncg. The output is now written to /etc/nagios/wlcg.d-new (Nagios server configuration) and /etc/nagios/nrpe-new/ (NRPE configuration for the UI). The generated configuration files require some editing before they can be used:
+
The configuration files for ncg.pl are present in /etc/ncg and they are generated by Yaim. The output is now written to /etc/nagios/wlcg.d (Nagios server configuration) and /etc/nagios/nrpe (NRPE configuration for the UI). The generated configuration files require some editing before they can be used:
* wlcg.d-new/host.groups.cfg: comment out the hostgroup ''nagios'' because it is already defined in our quattor setup
+
* wlcg.d/commands-edited.cfg: this is a copy of the generated file wlcg.d/commands.cfg; the 2 commands for pnp4nagios have to be commented out since they are already defined by the Quattor setup (and cannot be removed there).
* wlcg.d-new/wlcg.nagios.cfg: the value for attribute obsess_over_service has to be 1, or the Nagios master server will never receive the result of this check
 
  
After making these changes, change the symbolic link /etc/nagios/wlcg.d to /etc/nagios/wlcg.d-new and run 'nagios -v /etc/nagios/nagios.cfg' to verify that there are no errors. Then, the Nagios service can be restarted.
+
After making the changes, run
 
+
nagios -v /etc/nagios/nagios.cfg
Then copy the entire tree /etc/nagios/wlcg.d-new to Nagios master spade. At spade, again some changes have to be made:
+
to verify that there are no errors. Then, the Nagios service can be restarted.
* wlcg.d-new/service.templates.cfg: in the definition of the service type ncg-generic-service, the values of the attributes active_checks_enabled and passive_check_enabled should be changed to:
 
        active_checks_enabled          0
 
        passive_checks_enabled          1
 
(i.e., do not execute any active checks, but wait for check results to be submitted).
 
 
 
Again, create a symbolic link /etc/nagios/wlcg.d to /etc/nagios/wlcg.d-new, verify that there are no errors in the configuration file and restart service nagios.
 
  
 
=== Nagios grid proxy ===
 
=== Nagios grid proxy ===
Line 30: Line 23:
 
~ronalds/bin/nagios-refresh  
 
~ronalds/bin/nagios-refresh  
  
tbn12 inloggen als wortel
+
horige inloggen als wortel
  
 
service httpd start
 
service httpd start

Revision as of 11:11, 1 March 2010

This article describes the current setup of the Nagios monitoring setup. Note that this setup is being modified, so the information described here may soon be outdated.

Hosts:

  • spade: Nagios master server, collects information from active Nagios servers via NSCA and presents it via the web interface and actively schedules checks for grid monitoring (either via UI horige, or fetching results from SAM, GOC, GGUS, gStat).
  • riek: Nagios server for production grid servers (including storage servers) and generic servers.
  • eg: Nagios server for the worker nodes. The check results are published via NSCA to spade.
  • tbn06: Nagios server for the ITB. The check results are published via NSCA to spade.
  • horige: gLite 3.2 UI dedicated to grid monitoring. Nagios server schedules the grid checks via NRPE. To run the grid checks, a valid grid proxy must exist on host horige. This UI does not run a Nagios server.


Nagios configuration

At present, all of the Nagios configuration is handled by the quattor setup, except for the grid-specific checks. These checks are generated by the script ncg.pl at host spade via the following command:

ncg.pl

The configuration files for ncg.pl are present in /etc/ncg and they are generated by Yaim. The output is now written to /etc/nagios/wlcg.d (Nagios server configuration) and /etc/nagios/nrpe (NRPE configuration for the UI). The generated configuration files require some editing before they can be used:

  • wlcg.d/commands-edited.cfg: this is a copy of the generated file wlcg.d/commands.cfg; the 2 commands for pnp4nagios have to be commented out since they are already defined by the Quattor setup (and cannot be removed there).

After making the changes, run

nagios -v /etc/nagios/nagios.cfg

to verify that there are no errors. Then, the Nagios service can be restarted.

Nagios grid proxy

~ronalds/bin/nagios-refresh

horige inloggen als wortel

service httpd start

dan kijken op https://tbn12.nikhef.nl/nagios don't forget to have an openVPN tunnel first

show host wierde

myproxylifetime check

reschedule the next check of this service

should then show a valid proxy ... then you can stop the httpd daemon on tbn12.

make a dteam voms proxy

scp this proxy to

scp /tmp/x509up_u500 root@tbn12:/etc/nagios/globus/userproxy.pem

warning comes 24 hours beforehand : proxy will expire

run this:

~ronalds/bin/nagios-refresh

if the proxy completely expires then this does not work ... need to redo the scp.

Note : to get your proxy OUT of the server:

GT_PROXY_MODE=old myproxy-destroy -s wierde -l nagios -k NagiosRefresh

[ with the proxy of the person whose proxy is in wierde ... ]