Difference between revisions of "DANS Job Scripts"

From BiGGrid Wiki
Jump to navigation Jump to search
(Created page with "==The DANS Job Scripts== Each job that is submitted by the '<tt>compress-tar</tt>' and '<tt>check-tar</tt>' scripts is registered in the DANS job directory. A set of scripts is a...")
 
Line 34: Line 34:
 
* the jobid as seen by the grid WMS
 
* the jobid as seen by the grid WMS
 
For jobs that have completed the following files are also stored:
 
For jobs that have completed the following files are also stored:
* the log of the 'glite-wms-get-output' command which was used to retrieve the job output
+
* the log of the '<tt>glite-wms-get-output</tt>' command which was used to retrieve the job output
* the entire job log that was retrieved using the 'glite-wms-get-logging-info' command
+
* the entire job log that was retrieved using the '<tt>glite-wms-get-logging-info</tt>' command
* a directory 'output' where the output files of that job are stored.
+
* a directory '<tt>output</tt>' where the output files of that job are stored.
  
 
===An example: gridjob 00128===
 
===An example: gridjob 00128===

Revision as of 17:32, 8 November 2012

The DANS Job Scripts

Each job that is submitted by the 'compress-tar' and 'check-tar' scripts is registered in the DANS job directory. A set of scripts is available to query the status of these jobs or to display information about previous jobs.

job-status script

After jobs have been submitted to the grid you can track the status of these jobs using the 'job-status' script. It will scan the DANS job directory for all active jobs and will query the status of each of them. If a job has finished the 'job-status' script will retrieve the output automatically and will also record the job logging info in its corresponding DANS job directory. If there are no active jobs then the 'job-status' script performs no actions.

$ ./job-status -h
./job-status - check the status of all DANS 'RACM' jobs in /home/janjust/dans/gridjobs
Usage: ./job-status [-q|--quiet] [-d|--debug] [-k|--keepgoing] [--jobdir dir]
Where:
  --keepgoing    tells ./job-status to keep going after an error
  --jobdir dir   overrules the default value of the JOBDIR variable

The flag '-q' or '--quiet' suppresses a lot of output. The flag '-d' or '--debug' produces a lot of extra output. You can combine '-q' and '-d'.

job-info script

The 'job-info' script can be used to display the status of all jobs in the DANS job directory. It will only display information and will not query active jobs. This script is intended for troubleshooting purposes mostly.

$ ./job-info -h
./job-info - print info on all DANS 'RACM' jobs in /home/janjust/dans/gridjobs
Usage: ./job-info [-q|--quiet] [-d|--debug] [-k|--keepgoing] [--jobdir dir]
Where:
  --keepgoing    tells ./job-info to keep going after an error
  --jobdir dir   overrules the default value of the JOBDIR variable

The flag '-q' or '--quiet' suppresses a lot of output. The flag '-d' or '--debug' produces a lot of extra output. You can combine '-q' and '-d'.

The DANS job directory

The default location of the DANS job directory is $HOME/dans/gridjobs. Underneath this directory you will find directories with the name of the DANS jobid. These jobids currently are 5 digit numbers, e.g. 00128. In each of the job directories the following information is recorded:

  • the job status
  • the job's JDL file
  • the jobid as seen by the grid WMS

For jobs that have completed the following files are also stored:

  • the log of the 'glite-wms-get-output' command which was used to retrieve the job output
  • the entire job log that was retrieved using the 'glite-wms-get-logging-info' command
  • a directory 'output' where the output files of that job are stored.

An example: gridjob 00128

The DANS grid job #00128 was run on June 12th 2012. It was a job to verify the MD5 checksums of an archive which was uploaded previously. The job completed successfully. This can all be seen by looking at the contents of the job directory:

$HOME/dans/gridjobs:
  Status=Cleared
  jdl
  job-get-output.log
  job-logging-info.log
  jobid
  output/adler32sums.txt
  output/md5sums.tar.gz
  output/stderror
  output/stdout

The first entry, 'Status=Cleared', is actually an empty file and it reflects the state of the job. The 'Status=Cleared' means that the job has completed and that its output was downloaded ("cleared") from the WMS. The following entries are possible for the 'Status=...' file:

  • Status=Submitted : means the job has been submitted but has not been scheduled yet
  • Status=Scheduled : means the job has been accepted by a grid site and is scheduled for execution
  • Status=Running,
  • Status=ReallyRunning : means the job is now actively running. The difference between 'Running' and 'ReallyRunning' is mostly a historic artefact.
  • Status=Done : means the job has completed; the script that was run might have returned an error, but the WMS and batch system now consider the job 'done'.
  • Status=Cleared : means the job has completed and that its output was downloaded ("cleared") from the WMS.
  • Status=Aborted : means the job did not run successfully and was aborted by the grid WMS and/or batch system.

Normally a job directory contains only a single 'Status=...' entry.

The file 'jdl' is the Job Description Language file that was used during the submission of the job. The contents of this file are:

 Executable = "check-archive.sh";
 Arguments = "RACM 1156 1167";
 Stdoutput = "stdout";
 StdError = "stderror";
 InputSandbox = { "check-archive.sh", "adler32sum", "md5deep" };
 OutputSandbox = { "stdout", "stderror", "adler32sums.txt", "md5sums.tar.gz" };
 Requirements = other.GlueCEPolicyMaxCPUTime >= 300;

The JDL file shows us that this was a 'check-archive' job to verify the .tar.gz files from the archive RACM, numbered RACM-1156.tar.gz upto and including RACM-1167.tar.gz. The input sandbox files lists the 'check-archive.sh' script itself and two binary programs that the script needs during execution. The first binary is used to calculate the ADLER32 checksum of the .tar.gz file itself, the second is the command used to calculate the MD5 checksums of an entire directory tree. The requested output files are 'stdout', 'stderror', 'adler32sums.txt' and 'md5sums.tar.gz'. These are the file we will expect in the job output directory.

The 'job-get-output.log' and 'job-logging-info.log' files are the log files from the commands 'glite-wms-job-output' and 'glite-wms-job-logging-info'. These commands are called by the 'job-status' script after the job has finished running. These log files are normally not needed but they contain a lot of debugging and troubleshooting information about the job itself. For example, information on when and where the job was run can be found in the 'job-logging-info.log' file.

The 'jobid' file is the jobid as recorded by the 'glite-wms-job-submit' command. It is only needed during while the job is running, as the jobid is discarded by the WMS after the job has been "cleared" using the 'glite-wms-job-output' command.

The directory 'output' contains the expected output files. The file 'adler32sums.txt' is the list of ADLER32 checksums of all .tar.gz files that have been verified. The 'mdsums.tar.gz' file is a tarball containing the MD5 checksums files of each .tar.gz file that has been verified. The reason that the RACM-*.tar.md5sum files are again stored in a single .tar.gz file is that it simplifies the JDL file used for submission.