Difference between revisions of "Monitoring Script"

From PDP/Grid Wiki
Jump to navigationJump to search
m
 
(2 intermediate revisions by the same user not shown)
Line 1: Line 1:
The following script checks all the WNs, in order to look for jobs that are not running anymore. So far the jobs are not deleted, but an email is sent as an alert. After knowing the job id, it gives extra information about it via 'tracejob':
+
The following script checks all the WNs, in order to look for jobs that are not running anymore.
 +
 
 +
So far only those jobs that have been sent a SIGKILL are deleted; and an email is still sent. If the result of a tracejob is showed in the email, then there could be another reason to mark a job as eligible to be deleted:
  
 
   #!/bin/bash
 
   #!/bin/bash
 
   wn_list=`pbsnodes -a | grep " wn-" | cut -f9 -d' '`                            #List of WNs
 
   wn_list=`pbsnodes -a | grep " wn-" | cut -f9 -d' '`                            #List of WNs
   FILE="/tmp/monitoring_script/results/Monitoring_Results_`date +%k%M%d%m%y`"    #Results file, sent by email
+
   FILE="/tmp/Monitoring_Results_`date +%k%M%d%m%y`"    #Results file, sent by email
 
   wn_result=""                                                                    #Diagnosis of a WN
 
   wn_result=""                                                                    #Diagnosis of a WN
 
   jobidlist=""                                                                    #List of jobs to delete
 
   jobidlist=""                                                                    #List of jobs to delete
Line 26: Line 28:
 
                           echo $job_id
 
                           echo $job_id
 
                           jobresult=`tracejob -n 8 -q $job_id`
 
                           jobresult=`tracejob -n 8 -q $job_id`
                           if echo ${jobresult} | awk '/SIGKILL/' > /dev/null      #A SIGKILL signal was sent
+
                           if echo ${jobresult} | grep "SIGKILL" > /dev/null      #A SIGKILL signal was sent
 
                           then
 
                           then
 
                                   job_id="$job_id.stro.nikhef.nl"
 
                                   job_id="$job_id.stro.nikhef.nl"
Line 45: Line 47:
 
           echo "There are no jobs to delete" >> "$FILE"
 
           echo "There are no jobs to delete" >> "$FILE"
 
   fi
 
   fi
 +
mail -s "Monitoring Results" <EMAIL_ADDRESS> < "$FILE"                          #<EMAIL_ADDRESS> has to be edited
 +
rm $FILE

Latest revision as of 16:08, 25 January 2010

The following script checks all the WNs, in order to look for jobs that are not running anymore.

So far only those jobs that have been sent a SIGKILL are deleted; and an email is still sent. If the result of a tracejob is showed in the email, then there could be another reason to mark a job as eligible to be deleted:

 #!/bin/bash
 wn_list=`pbsnodes -a | grep " wn-" | cut -f9 -d' '`                             #List of WNs
 FILE="/tmp/Monitoring_Results_`date +%k%M%d%m%y`"     #Results file, sent by email
 wn_result=""                                                                    #Diagnosis of a WN
 jobidlist=""                                                                    #List of jobs to delete
 i=0                                                                             #Counter of WN with jobs to delete
 jobresult=""                                                                    #Result of tracejob
 
 echo "===============================================================================" >> "$FILE"
 echo "=========================== AFFECTED WN DIAGNOSIS =============================" >> "$FILE"
 echo "===============================================================================" >> "$FILE"
 for wn in ${wn_list[@]}                                                         #Exploring the wn list
 do
         wn_result=`momctl -d2 -h $wn`
         jobidlist=`echo $wn_result | grep "sidlist= " `
         if [ ${#jobidlist} -ne 0 ]                                              #Is there any job to be deleted?
         then
                 echo $wn_result >> "$FILE"
                 echo "_______________________________________________________________________________" >> "$FILE"
                 job_id_list=`echo $jobidlist | sed -e 's/\[/\n/g' | grep "sidlist= " | cut -f1 -d'.'`
                 i=`expr $i + 1`
                 for job_id in ${job_id_list[@]}                                 #Exploring the jobs
                 do
                         echo $job_id
                         jobresult=`tracejob -n 8 -q $job_id`
                         if echo ${jobresult} | grep "SIGKILL" > /dev/null      #A SIGKILL signal was sent
                         then
                                 job_id="$job_id.stro.nikhef.nl"
                                 echo "The job $job_id can be deleted" >> "$FILE"
                                 momctl -h $wn -c $job_id                        #Deleting the job...
                                 echo "The job $job_id has been deleted from the queue and the WN $wn" >> "$FILE"
                         else
                                 echo $jobresult >> "$FILE"                      #This could give another reason to delete the job
                         fi
                         echo "_______________________________________________________________________________" >> "$FILE"
                 done
                 echo "_______________________________________________________________________________" >> "$FILE"
                 echo "_______________________________________________________________________________" >> "$FILE"
         fi
 done
 if [ $i -eq 0 ]
 then
         echo "There are no jobs to delete" >> "$FILE"
 fi
mail -s "Monitoring Results" <EMAIL_ADDRESS> < "$FILE"                           #<EMAIL_ADDRESS> has to be edited
rm $FILE