Batch System Interoperability

From PDP/Grid Wiki
Jump to navigationJump to search

When used on a worker node (in a late binding pilot job scenario), gLExec attempts really hard to be neutral to its OS environment. In particular, gLExec will not break the process tree, and will accumulate CPU and system usage times from the child processes it spawns. We recognize that this is particularly important in the gLExec-on-WN scenario, where the entire process (pilot job and target user processes) should be managed as a whole by the node-local batch system daemon.

You are encouraged to verify OS and batch system interoperability. In order to do that, you have two options:

  • Comprehensive testing: Ulrich Schwickerath has defined a series of (partially CERN-specific) tests to verify that glExec does not break the batch system setup of a site. He has extensively documented his efforts on the Wiki at Note that the Local Tools section is CERN-specific. If you use other tools to clean up the user's work area (such as the $tmpdir facility of PBSPro and Troque), or use the PruneUserproc utility to remove stray processes, you are not affected by this.
  • Basic OS and batch-system testing can be done even without installing glExec, but just compiling a simple C program with one hard-coded uid for testing. This is the fastest solution for testing, but only verifies that your batch system reacts correctly, not that your other grid-aware system script will work as you expect.

The following batch systems are known to be compatible with gLExec-on-the-Worker-Node:

  • Torque, all versions
  • OpenPBS, all versions
  • Platform LSF, all versions
  • BQS, all versions
  • Condor, all versions

If you notice any anomalies after testing, i.e. the job will not die, please notify the developers at grid dash mw dash security at nikhef dot nl.