Monitoring Script
From PDP/Grid Wiki
Jump to navigationJump to searchThe following script checks all the WNs, in order to look for jobs that are not running anymore.
So far only those jobs that have been sent a SIGKILL are deleted; and an email is still sent. If the result of a tracejob is showed in the email, then there could be another reason to mark a job as eligible to be deleted:
#!/bin/bash wn_list=`pbsnodes -a | grep " wn-" | cut -f9 -d' '` #List of WNs FILE="/tmp/Monitoring_Results_`date +%k%M%d%m%y`" #Results file, sent by email wn_result="" #Diagnosis of a WN jobidlist="" #List of jobs to delete i=0 #Counter of WN with jobs to delete jobresult="" #Result of tracejob echo "===============================================================================" >> "$FILE" echo "=========================== AFFECTED WN DIAGNOSIS =============================" >> "$FILE" echo "===============================================================================" >> "$FILE" for wn in ${wn_list[@]} #Exploring the wn list do wn_result=`momctl -d2 -h $wn` jobidlist=`echo $wn_result | grep "sidlist= " ` if [ ${#jobidlist} -ne 0 ] #Is there any job to be deleted? then echo $wn_result >> "$FILE" echo "_______________________________________________________________________________" >> "$FILE" job_id_list=`echo $jobidlist | sed -e 's/\[/\n/g' | grep "sidlist= " | cut -f1 -d'.'` i=`expr $i + 1` for job_id in ${job_id_list[@]} #Exploring the jobs do echo $job_id jobresult=`tracejob -n 8 -q $job_id` if echo ${jobresult} | grep "SIGKILL" > /dev/null #A SIGKILL signal was sent then job_id="$job_id.stro.nikhef.nl" echo "The job $job_id can be deleted" >> "$FILE" momctl -h $wn -c $job_id #Deleting the job... echo "The job $job_id has been deleted from the queue and the WN $wn" >> "$FILE" else echo $jobresult >> "$FILE" #This could give another reason to delete the job fi echo "_______________________________________________________________________________" >> "$FILE" done echo "_______________________________________________________________________________" >> "$FILE" echo "_______________________________________________________________________________" >> "$FILE" fi done if [ $i -eq 0 ] then echo "There are no jobs to delete" >> "$FILE" fi mail -s "Monitoring Results" <EMAIL_ADDRESS> < "$FILE" #<EMAIL_ADDRESS> has to be edited rm $FILE