Difference between revisions of "DANS Job Scripts"

Revision as of 15:38, 9 November 2012

The DANS Job Scripts

Currently there are 5 scripts for the entire DANS workflow, divided over each of the phases of the DANS workflow. There are also 2 job monitoring scripts, which are used in both phase 2 and phase 3 of the DANS workflow.

Phase 1: Data Upload
- gen-tar-list
- upload-tar
Phase 2: Data Compress
- compress-tar
Phase 3: Data Verification
- verify-tar
- compare-checksums
Job monitoring
- job-status
- job-info

Each job that is submitted by the 'compress-tar' and 'check-tar' scripts is registered in the DANS job directory.

These scripts, as well as the layout of the DANS job directory are described on this page.

gen-tar-list script

Before an archive can be uploaded to the grid a listing of all entries needs to be made. This listing is split across multiple tarballs (.tar files) so that each tarball archive is at least 8 GB in size. The gen-tar-list script processes a full directory listing and splits it into separate '$ARCHIVE-nnnn.tar.lst' files, where 'nnnn' is a counter starting at 1. The gen-tar-list takes a single argument

$ ./gen-tar-list ${ARCHIVE}-files.txt

but if no argument is specified then the name of the current archive is determined from the directory in which the 'gen-tar-list' script itself is located. Thus, if a copy of or symlink to the gen-tar-list script is in the directory

$HOME/dans/soundbites

then the archive name is 'soundbites'.

upload-tar script

This script is the main script of phase 1 of the DANS workflow. It generates checksums and tarballs based for each of the '$ARCHIVE-nnnn.tar.lst' files and then uploads the tarball to the grid. As this script can take a long to complete, it is very handy to run it inside a 'screen' session.

The 'upload-tar' script (or a symlink to this script) should be located in the same directory as where the '$ARCHIVE-nnnn.tar.lst' files are. The name of the archive is determined by looking at the location of the script. Thus, if 'upload-tar' script is in the directory

$HOME/dans/soundbites

then the archive name is decuded as 'soundbites'.

The 'upload-tar' script takes the following arguments:

$ ./upload-tar - upload the .tar tarballs of the DANS archive 'scripts' 
Usage: ./upload-tar [-q|--quiet] [-d|--debug] [-k|--keepgoing] [--scratchdir dir]
Where:
  --keepgoing        tells ./upload-tar to keep going after an error
  --scratchdir=dir   specifies the directory where the temporary tarballs are stored 
                     (default='/data/vancis2109/grid')

job-status script

After jobs have been submitted to the grid you can track the status of these jobs using the 'job-status' script. It will scan the DANS job directory for all active jobs and will query the status of each of them. If a job has finished the 'job-status' script will retrieve the output automatically and will also record the job logging info in its corresponding DANS job directory. If there are no active jobs then the 'job-status' script performs no actions.

$ ./job-status -h
./job-status - check the status of all DANS 'RACM' jobs in /home/janjust/dans/gridjobs
Usage: ./job-status [-q|--quiet] [-d|--debug] [-k|--keepgoing] [--jobdir dir]
Where:
  --keepgoing    tells ./job-status to keep going after an error
  --jobdir dir   overrules the default value of the JOBDIR variable

The flag '-q' or '--quiet' suppresses a lot of output. The flag '-d' or '--debug' produces a lot of extra output. You can combine '-q' and '-d'.

job-info script

The 'job-info' script can be used to display the status of all jobs in the DANS job directory. It will only display information and will not query active jobs. This script is intended for troubleshooting purposes mostly.

$ ./job-info -h
./job-info - print info on all DANS 'RACM' jobs in /home/janjust/dans/gridjobs
Usage: ./job-info [-q|--quiet] [-d|--debug] [-k|--keepgoing] [--jobdir dir]
Where:
  --keepgoing    tells ./job-info to keep going after an error
  --jobdir dir   overrules the default value of the JOBDIR variable

The flag '-q' or '--quiet' suppresses a lot of output. The flag '-d' or '--debug' produces a lot of extra output. You can combine '-q' and '-d'.

The DANS job directory

The default location of the DANS job directory is $HOME/dans/gridjobs. Underneath this directory you will find directories with the name of the DANS jobid. These jobids currently are 5 digit numbers, e.g. 00128. In each of the job directories the following information is recorded:

the job status
the job's JDL file
the jobid as seen by the grid WMS

For jobs that have completed the following files are also stored:

the log of the 'glite-wms-get-output' command which was used to retrieve the job output
the entire job log that was retrieved using the 'glite-wms-get-logging-info' command
a directory 'output' where the output files of that job are stored.

An example: gridjob 00128

The DANS grid job #00128 was run on June 12th 2012. It was a job to verify the MD5 checksums of an archive which was uploaded previously. The job completed successfully. This can all be seen by looking at the contents of the job directory:

$HOME/dans/gridjobs:
  Status=Cleared
  jdl
  job-get-output.log
  job-logging-info.log
  jobid
  output/adler32sums.txt
  output/md5sums.tar.gz
  output/stderror
  output/stdout

The first entry, 'Status=Cleared', is actually an empty file and it reflects the state of the job. The 'Status=Cleared' means that the job has completed and that its output was downloaded ("cleared") from the WMS. The following entries are possible for the 'Status=...' file:

Status=Submitted : means the job has been submitted but has not been scheduled yet
Status=Scheduled : means the job has been accepted by a grid site and is scheduled for execution
Status=Running, Status=ReallyRunning : means the job is now actively running. The difference between 'Running' and 'ReallyRunning' is mostly a historic artefact.
Status=Done : means the job has completed; the script that was run might have returned an error, but the WMS and batch system now consider the job 'done'.
Status=Cleared : means the job has completed and that its output was downloaded ("cleared") from the WMS.
Status=Aborted : means the job did not run successfully and was aborted by the grid WMS and/or batch system.

Normally a job directory contains only a single 'Status=...' entry.

The file 'jdl' is the Job Description Language file that was used during the submission of the job. The contents of this file are:

 Executable = "check-archive.sh";
 Arguments = "RACM 1156 1167";
 Stdoutput = "stdout";
 StdError = "stderror";
 InputSandbox = { "check-archive.sh", "adler32sum", "md5deep" };
 OutputSandbox = { "stdout", "stderror", "adler32sums.txt", "md5sums.tar.gz" };
 Requirements = other.GlueCEPolicyMaxCPUTime >= 300;

The JDL file shows us that this was a 'check-archive' job to verify the .tar.gz files from the archive RACM, numbered RACM-1156.tar.gz upto and including RACM-1167.tar.gz. The input sandbox files lists the 'check-archive.sh' script itself and two binary programs that the script needs during execution. The first binary is used to calculate the ADLER32 checksum of the .tar.gz file itself, the second is the command used to calculate the MD5 checksums of an entire directory tree. The requested output files are 'stdout', 'stderror', 'adler32sums.txt' and 'md5sums.tar.gz'. These are the file we will expect in the job output directory.

The 'job-get-output.log' and 'job-logging-info.log' files are the log files from the commands 'glite-wms-job-output' and 'glite-wms-job-logging-info'. These commands are called by the 'job-status' script after the job has finished running. These log files are normally not needed but they contain a lot of debugging and troubleshooting information about the job itself. For example, information on when and where the job was run can be found in the 'job-logging-info.log' file.

The 'jobid' file is the jobid as recorded by the 'glite-wms-job-submit' command. It is only needed during while the job is running, as the jobid is discarded by the WMS after the job has been "cleared" using the 'glite-wms-job-output' command.

The directory 'output' contains the expected output files. The file 'adler32sums.txt' is the list of ADLER32 checksums of all .tar.gz files that have been verified. The 'mdsums.tar.gz' file is a tarball containing the MD5 checksums files of each .tar.gz file that has been verified. The reason that the RACM-*.tar.md5sum files are again stored in a single .tar.gz file is that it simplifies the JDL file used for submission.