Difference between revisions of "Monitoring Script"

From PDP/Grid Wiki
Jump to navigationJump to search
m
Line 1: Line 1:
 
The following script checks all the WNs, in order to look for jobs that are not running anymore. So far the jobs are not deleted, but an email is sent as an alert. After knowing the job id, it gives extra information about it via 'tracejob':
 
The following script checks all the WNs, in order to look for jobs that are not running anymore. So far the jobs are not deleted, but an email is sent as an alert. After knowing the job id, it gives extra information about it via 'tracejob':
  
   FILE="results/Monitoring_Results_`date +%k%M%d%m%y`"
+
  #!/bin/bash
   y=""
+
 
   job_id_list_of_list=""
+
  wn_list=`pbsnodes -a | grep " wn-" | cut -f9 -d' '`                            #List of WNs
   i=0
+
   FILE="/tmp/monitoring_script/results/Monitoring_Results_`date +%k%M%d%m%y`"     #Results file, sent by email
 +
   wn_result=""                                                                   #Diagnosis of a WN
 +
   jobidlist=""                                                                   #List of jobs to delete
 +
   i=0                                                                             #Counter of WN with jobs to delete
 +
  jobresult=""                                                                    #Result of tracejob
 +
 
 
   echo "===============================================================================" >> "$FILE"
 
   echo "===============================================================================" >> "$FILE"
 
   echo "=========================== AFFECTED WN DIAGNOSIS =============================" >> "$FILE"
 
   echo "=========================== AFFECTED WN DIAGNOSIS =============================" >> "$FILE"
 
   echo "===============================================================================" >> "$FILE"
 
   echo "===============================================================================" >> "$FILE"
   for wn in ${wn_list[@]}
+
   for wn in ${wn_list[@]}                                                         #Exploring the wn list
 
   do
 
   do
        y="`momctl -d0 -h $wn | grep "sidlist" | grep -v "RUNNING" | sed -e 's/job\[//g' | cut -f1 -d'.'`"
+
          wn_result=`momctl -d2 -h $wn`
        len=${#y}
+
          jobidlist=`echo $wn_result | grep "sidlist= " `
        if [ $len -ne 0 ]
+
          if [ ${#jobidlist} -ne 0 ]                                              #Is there any job to be deleted?
        then
+
          then
                momctl -d2 -h $wn >> "$FILE"
+
                  echo $wn_result >> "$FILE"
                echo "_______________________________________________________________________________" >> "$FILE"
+
                  echo "_______________________________________________________________________________" >> "$FILE"
                job_id_list_of_list[i]="${y}"
+
                  job_id_list=`echo $jobidlist | sed -e 's/\[/\n/g' | grep "sidlist= " | cut -f1 -d'.'`
                i=$i+1
+
                  i=`expr $i + 1`
        fi
+
                  for job_id in ${job_id_list[@]}                                #Exploring the jobs
 +
                  do
 +
                          echo $job_id
 +
                          jobresult=`tracejob -n 8 -q $job_id`
 +
                          if echo ${jobresult} | awk '/SIGKILL/' > /dev/null      #A SIGKILL signal was sent
 +
                          then
 +
                                  job_id="$job_id.stro.nikhef.nl"
 +
                                  echo "The job $job_id can be deleted" >> "$FILE"
 +
                                  momctl -h $wn -c $job_id                        #Deleting the job...
 +
                                  echo "The job $job_id has been deleted from the queue and the WN $wn" >> "$FILE"
 +
                          else
 +
                                  echo $jobresult >> "$FILE"                     #This could give another reason to delete the job
 +
                          fi
 +
                          echo "_______________________________________________________________________________" >> "$FILE"
 +
                  done
 +
                  echo "_______________________________________________________________________________" >> "$FILE"
 +
                  echo "_______________________________________________________________________________" >> "$FILE"
 +
          fi
 
   done
 
   done
   j=0
+
   if [ $i -eq 0 ]
  echo "===============================================================================" >> "$FILE"
+
   then
  echo "=========================== JOBS ELIGIBLE TO BE DELETED =======================" >> "$FILE"
+
          echo "There are no jobs to delete" >> "$FILE"
  echo "===============================================================================" >> "$FILE"
+
   fi
  for job_id_list in ${job_id_list_of_list[@]}
 
   do
 
        for job_id in ${job_id_list_of_list[$j][@]}
 
        do
 
                echo $job_id
 
                tracejob -n 8 -q $job_id >> "$FILE"
 
                echo "_______________________________________________________________________________" >> "$FILE"
 
        done
 
        j=$j+1
 
  done
 
   mail -s "Monitoring Results" fbernabe<-AT->nikhef<-DOT->nl < "$FILE"
 

Revision as of 18:13, 3 December 2009

The following script checks all the WNs, in order to look for jobs that are not running anymore. So far the jobs are not deleted, but an email is sent as an alert. After knowing the job id, it gives extra information about it via 'tracejob':

 #!/bin/bash
 wn_list=`pbsnodes -a | grep " wn-" | cut -f9 -d' '`                             #List of WNs
 FILE="/tmp/monitoring_script/results/Monitoring_Results_`date +%k%M%d%m%y`"     #Results file, sent by email
 wn_result=""                                                                    #Diagnosis of a WN
 jobidlist=""                                                                    #List of jobs to delete
 i=0                                                                             #Counter of WN with jobs to delete
 jobresult=""                                                                    #Result of tracejob
 
 echo "===============================================================================" >> "$FILE"
 echo "=========================== AFFECTED WN DIAGNOSIS =============================" >> "$FILE"
 echo "===============================================================================" >> "$FILE"
 for wn in ${wn_list[@]}                                                         #Exploring the wn list
 do
         wn_result=`momctl -d2 -h $wn`
         jobidlist=`echo $wn_result | grep "sidlist= " `
         if [ ${#jobidlist} -ne 0 ]                                              #Is there any job to be deleted?
         then
                 echo $wn_result >> "$FILE"
                 echo "_______________________________________________________________________________" >> "$FILE"
                 job_id_list=`echo $jobidlist | sed -e 's/\[/\n/g' | grep "sidlist= " | cut -f1 -d'.'`
                 i=`expr $i + 1`
                 for job_id in ${job_id_list[@]}                                 #Exploring the jobs
                 do
                         echo $job_id
                         jobresult=`tracejob -n 8 -q $job_id`
                         if echo ${jobresult} | awk '/SIGKILL/' > /dev/null      #A SIGKILL signal was sent
                         then
                                 job_id="$job_id.stro.nikhef.nl"
                                 echo "The job $job_id can be deleted" >> "$FILE"
                                 momctl -h $wn -c $job_id                        #Deleting the job...
                                 echo "The job $job_id has been deleted from the queue and the WN $wn" >> "$FILE"
                         else
                                 echo $jobresult >> "$FILE"                      #This could give another reason to delete the job
                         fi
                         echo "_______________________________________________________________________________" >> "$FILE"
                 done
                 echo "_______________________________________________________________________________" >> "$FILE"
                 echo "_______________________________________________________________________________" >> "$FILE"
         fi
 done
 if [ $i -eq 0 ]
 then
         echo "There are no jobs to delete" >> "$FILE"
 fi