Difference between revisions of "Monitoring Script"
From PDP/Grid Wiki
Jump to navigationJump to searchm |
|||
| Line 1: | Line 1: | ||
The following script checks all the WNs, in order to look for jobs that are not running anymore. So far the jobs are not deleted, but an email is sent as an alert. After knowing the job id, it gives extra information about it via 'tracejob': | The following script checks all the WNs, in order to look for jobs that are not running anymore. So far the jobs are not deleted, but an email is sent as an alert. After knowing the job id, it gives extra information about it via 'tracejob': | ||
| − | FILE="results/Monitoring_Results_`date +%k%M%d%m%y`" | + | #!/bin/bash |
| − | + | ||
| − | + | wn_list=`pbsnodes -a | grep " wn-" | cut -f9 -d' '` #List of WNs | |
| − | i=0 | + | FILE="/tmp/monitoring_script/results/Monitoring_Results_`date +%k%M%d%m%y`" #Results file, sent by email |
| + | wn_result="" #Diagnosis of a WN | ||
| + | jobidlist="" #List of jobs to delete | ||
| + | i=0 #Counter of WN with jobs to delete | ||
| + | jobresult="" #Result of tracejob | ||
| + | |||
echo "===============================================================================" >> "$FILE" | echo "===============================================================================" >> "$FILE" | ||
echo "=========================== AFFECTED WN DIAGNOSIS =============================" >> "$FILE" | echo "=========================== AFFECTED WN DIAGNOSIS =============================" >> "$FILE" | ||
echo "===============================================================================" >> "$FILE" | echo "===============================================================================" >> "$FILE" | ||
| − | for wn in ${wn_list[@]} | + | for wn in ${wn_list[@]} #Exploring the wn list |
do | do | ||
| − | + | wn_result=`momctl -d2 -h $wn` | |
| − | + | jobidlist=`echo $wn_result | grep "sidlist= " ` | |
| − | + | if [ ${#jobidlist} -ne 0 ] #Is there any job to be deleted? | |
| − | + | then | |
| − | + | echo $wn_result >> "$FILE" | |
| − | + | echo "_______________________________________________________________________________" >> "$FILE" | |
| − | + | job_id_list=`echo $jobidlist | sed -e 's/\[/\n/g' | grep "sidlist= " | cut -f1 -d'.'` | |
| − | + | i=`expr $i + 1` | |
| − | + | for job_id in ${job_id_list[@]} #Exploring the jobs | |
| + | do | ||
| + | echo $job_id | ||
| + | jobresult=`tracejob -n 8 -q $job_id` | ||
| + | if echo ${jobresult} | awk '/SIGKILL/' > /dev/null #A SIGKILL signal was sent | ||
| + | then | ||
| + | job_id="$job_id.stro.nikhef.nl" | ||
| + | echo "The job $job_id can be deleted" >> "$FILE" | ||
| + | momctl -h $wn -c $job_id #Deleting the job... | ||
| + | echo "The job $job_id has been deleted from the queue and the WN $wn" >> "$FILE" | ||
| + | else | ||
| + | echo $jobresult >> "$FILE" #This could give another reason to delete the job | ||
| + | fi | ||
| + | echo "_______________________________________________________________________________" >> "$FILE" | ||
| + | done | ||
| + | echo "_______________________________________________________________________________" >> "$FILE" | ||
| + | echo "_______________________________________________________________________________" >> "$FILE" | ||
| + | fi | ||
done | done | ||
| − | + | if [ $i -eq 0 ] | |
| − | + | then | |
| − | + | echo "There are no jobs to delete" >> "$FILE" | |
| − | + | fi | |
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
| − | |||
Revision as of 16:13, 3 December 2009
The following script checks all the WNs, in order to look for jobs that are not running anymore. So far the jobs are not deleted, but an email is sent as an alert. After knowing the job id, it gives extra information about it via 'tracejob':
#!/bin/bash
wn_list=`pbsnodes -a | grep " wn-" | cut -f9 -d' '` #List of WNs
FILE="/tmp/monitoring_script/results/Monitoring_Results_`date +%k%M%d%m%y`" #Results file, sent by email
wn_result="" #Diagnosis of a WN
jobidlist="" #List of jobs to delete
i=0 #Counter of WN with jobs to delete
jobresult="" #Result of tracejob
echo "===============================================================================" >> "$FILE"
echo "=========================== AFFECTED WN DIAGNOSIS =============================" >> "$FILE"
echo "===============================================================================" >> "$FILE"
for wn in ${wn_list[@]} #Exploring the wn list
do
wn_result=`momctl -d2 -h $wn`
jobidlist=`echo $wn_result | grep "sidlist= " `
if [ ${#jobidlist} -ne 0 ] #Is there any job to be deleted?
then
echo $wn_result >> "$FILE"
echo "_______________________________________________________________________________" >> "$FILE"
job_id_list=`echo $jobidlist | sed -e 's/\[/\n/g' | grep "sidlist= " | cut -f1 -d'.'`
i=`expr $i + 1`
for job_id in ${job_id_list[@]} #Exploring the jobs
do
echo $job_id
jobresult=`tracejob -n 8 -q $job_id`
if echo ${jobresult} | awk '/SIGKILL/' > /dev/null #A SIGKILL signal was sent
then
job_id="$job_id.stro.nikhef.nl"
echo "The job $job_id can be deleted" >> "$FILE"
momctl -h $wn -c $job_id #Deleting the job...
echo "The job $job_id has been deleted from the queue and the WN $wn" >> "$FILE"
else
echo $jobresult >> "$FILE" #This could give another reason to delete the job
fi
echo "_______________________________________________________________________________" >> "$FILE"
done
echo "_______________________________________________________________________________" >> "$FILE"
echo "_______________________________________________________________________________" >> "$FILE"
fi
done
if [ $i -eq 0 ]
then
echo "There are no jobs to delete" >> "$FILE"
fi