Various local tools
This article present some local tools for use with the Torque resource manager. These tools are working with Torque version 2.3.8. Other versions may require modifications.
when_idle
Purpose: execute a command when a node has drained.
Description: this script is executed from the system epilogue script. At the end of each job, the script checks whether the following conditions are met:
- The node is offline and the offline comment matches a certain pattern (default: when_idle)
- The node is idle (contains no running batch jobs)
If the node is offline and idle, a script provided by the administrator can be executed if its md5sum is not present in an archive. This ensures that the scripts gets executed only once. After execution of the administrator's script, its md5sum is added to the archive and message is written to syslog. After a certain delay (15s), the node's status is cleared (i.e., no longer offline).
At node boot into run level 3, it is checked whether the node was offline with tag "when_idle". If that is the case, after a delay (600s) the node state is cleared and a message is sent to syslog. This handles the situation when the administrator's script reboots the node before the node's state could be cleared, which would keep the node offline.
Example: reboot the worker node when all batch jobs have finished, for example to load a new kernel.
# 2013-09-04 # Substitute <DATE> to force a different checksum # Put the action here reboot
prune_userprocs
Purpose: remove daemonized users processes after a batch job ends
Description: this script is executed from the system epilogue script. It searches for user processes that are left after their parent batch job ended. User processes started from a login or ssh session are excluded as well as processes running with a system account (uid below a certain level). This prevents stray processes from using resources meant for legitimate batch processes.
Example: kill all processes belonging to users with uid > 999 (-u) that are not part of a MOM job (-a) with signal 9 (-k).
prune_userprocs -a -k 9 -u 999
mom-taskset
Purpose:
Description:
Example: