DANS Job Scripts

From BiGGrid Wiki
Revision as of 16:27, 15 November 2012 by Janjust@nikhef.nl (talk | contribs) (→‎check-tar)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

Currently there are 5 scripts for the entire DANS workflow, divided over each of the phases of the DANS workflow. There are also 2 job monitoring scripts, which are used in both phase 2 and phase 3 of the DANS workflow.

Each job that is submitted by the 'compress-tar' and 'check-tar' scripts is registered in the DANS job directory.

These scripts, as well as the layout of the DANS job directory are described on this page.

The DANS Workflow Scripts

gen-tar-list

Before an archive can be uploaded to the grid a listing of all entries needs to be made. This listing is split across multiple tarballs (.tar files) so that each tarball archive is at least 8 GB in size. The gen-tar-list script processes a full directory listing and splits it into separate '$ARCHIVE-nnnn.tar.lst' files, where 'nnnn' is a counter starting at 1. The gen-tar-list takes a single argument

$ ./gen-tar-list ${ARCHIVE}-files.txt

but if no argument is specified then the name of the current archive is determined from the directory in which the 'gen-tar-list' script itself is located. Thus, if a copy of or symlink to the gen-tar-list script is in the directory

$HOME/dans/soundbites

then the archive name is 'soundbites'.

upload-tar

This script is the main script of phase 1 of the DANS workflow. It generates checksums and tarballs based for each of the '$ARCHIVE-nnnn.tar.lst' files and then uploads the tarball to the grid. As this script can take a long to complete, it is very handy to run it inside a 'screen' session.

The 'upload-tar' script (or a symlink to this script) should be located in the same directory as where the '$ARCHIVE-nnnn.tar.lst' files are. The name of the archive is determined by looking at the location of the script. Thus, if the 'upload-tar' script is in the directory '$HOME/dans/soundbites' then default value of the archive name is 'soundbites'.

For each tarball that is successfully uploaded to the grid the corresponding '$ARCHIVE-nnnn.tar.lst' and '$ARCHIVE-nnnn.tar.md5sum' files are moved to the 'done' directory. This allows for an easy restart of the script: if the script should fail the same command can be repeated and it will continue processing the remainder of the '$ARCHIVE-nnnn.tar.lst' files.

The 'upload-tar' script takes the following arguments:

$ ./upload-tar - upload the .tar tarballs of the DANS archive 'RACM' 
Usage: ./upload-tar [-q|--quiet] [-d|--debug] [-k|--keepgoing] [--scratchdir dir]
Where:
  --keepgoing        tells ./upload-tar to keep going after an error
  --scratchdir=dir   specifies the directory where the temporary tarballs are stored 
                     (default='/data/vancis2109/grid')

compress-tar

In order to save space the 'raw' tarballs are compressed to so-called .tar.gz files. This compression step is done on the grid itself, hence we need to submit a set of jobs to the grid. The 'compress-tar' script submits these jobs automatically for us. It performs the following actions:

  • determine the list of tarballs (.tar files) that need to be compressed. The script uses the LFC to find the list, but the user can influence which tarballs need to be compressed using the '--start=' and '--end=' parameters
  • split the list into several grid jobs. Each job will process a fixed set of tarballs (currently set to CHUNK_SIZE=15).
  • create job directories and job description (JDL) files for each of the jobs. This is done using the template JDL file 'Compress.jdl'.
  • submit each job to the grid WMS and record the jobid in the DANS job directory.

The 'compress-tar' script requires a valid grid proxy certificate to operate, it will abort if it cannot find one.

The 'compress-tar' script accepts the following parameters:

$ ./compress-tar --help
./compress-tar - compress the .tar tarballs for the DANS archive 'RACM' 
Usage: ./compress-tar [-q|--quiet] [-d|--debug] [-k|--keepgoing] [--archive=archive]
                      [--jobdir=dir] [--chunksize=N] [--start=N] [--end=N]
Where:
  --keepgoing        tells ./compress-tar to keep going after an error
  --archive=archive  specifies the name of the archive (default='RACM')
  --jobdir=dir       specifies the directory where the DANS jobs are stored 
                     (default='$HOME/dans/gridjobs')
  --chunksize=N      specifies the number of tarballs to compress per job (default=15)
  --start=N          specifies the starting index of the tarballs to check (default=1)
  --end=N            specifies the end index of the tarballs to check (default=ALL)

The flag '-q' or '--quiet' suppresses a lot of output. The flag '-d' or '--debug' produces a lot of extra output. You can combine '-q' and '-d'.

The name of the archive is determined by looking at the location of the script. Thus, if the 'compress-tar' script is in the directory '$HOME/dans/soundbites' then default value of the archive name is 'soundbites'. This can be overruled using the '--archive' parameter.

The default location of the DANS job directory is $HOME/dans/gridjobs, but this can be overruled using the '--jobdir' parameter.

The grid jobs submitted by the 'compress-tar' script return an exit code, which is reported by the 'job-status' script. The following exit codes can be returned:

  • 0 - the script exited successfully
  • 1 - the script was called with the wrong number of parameters. Normally this should not happen.
  • 2 - unable to find the first archive. This happens most often when the LFC is not available.
  • 3 - unable to retrieve a tarball. This happens most often when the grid storage facilities are down.
  • 4 - there was an error compressing a tarball. When this happens there is something wrong with the worker node on which the grid job was running.
  • 5 - there was an error storing the compress .tar.gz file. This can happen if either the grid storage facilities are down or if the .tar.gz file has been uploaded already.

check-tar

The 'check-tar' script is used to periodically verify the integrity of the DANS archives. This script submits a set of 'check-archive.sh' jobs to the grid. These jobs will go through all of the .tar.gz files for a particular archive and will recalculate the ADLER32 checksums of the .tar.gz files themselves, as well as the MD5 checksums of all contents found within each archive.

The resulting checksums are returned to the user, so that {s}he can compare them to the reference checksums locally. The script requires a valid grid proxy certificate to operate, it will abort if it cannot find one.

The 'check-tar' script accepts the following parameters:

$ ./check-tar --help
./check-tar - verify the tar.gz files for the DANS archive 'RACM' 
Usage: ./check-tar [-q|--quiet] [-d|--debug] [-k|--keepgoing] [--archive archive]
                   [--jobdir=dir] [--chunksize=N] [--start=N] [--end=N]
Where:
  --keepgoing        tells ./check-tar to keep going after an error
  --archive=archive  specifies the name of the archive (default='RACM')
  --jobdir=dir       specifies the directory where the DANS jobs are stored 
                     (default='$HOME/dans/gridjobs')
  --chunksize=N      specifies the number of tar.gz files to verify per job (default=8)
  --start=N          specifies the starting index of the tarballs to check (default=1)
  --end=N            specifies the end index of the tarballs to check (default=ALL)

The flag '-q' or '--quiet' suppresses a lot of output. The flag '-d' or '--debug' produces a lot of extra output. You can combine '-q' and '-d'.

The name of the archive is determined by looking at the location of the script. Thus, if the 'check-tar' script is in the directory '$HOME/dans/soundbites' then the archive name is deduced as 'soundbites'. This can be overruled using the '--archive' parameter.

The default location of the DANS job directory is $HOME/dans/gridjobs, but this can be overruled using the '--jobdir' parameter.

The grid jobs submitted by the 'check-tar' script return an exit code, which is reported by the 'job-status' script. The following exit codes can be returned:

  • 0 - the script exited successfully
  • 1 - the script was called with the wrong number of parameters. Normally this should not happen.
  • 2 - unable to find the first archive. This happens most often when the LFC is not available.
  • 3 - unable to retrieve a tar.gz file. This happens most often when the grid storage facilities are down.
  • 4 - there was an error calculating the ADLER32 checksum of the .tar.gz file itself. When this happens there is something wrong with the worker node on which the grid job was running.
  • 5 - there was an error calculating the MD5 checksums of the contents of a .tar.gz file. When this happens there is something wrong with the worker node on which the grid job was running.

compare-checksums

...

TODO

Job Monitoring Scripts

job-status

After jobs have been submitted to the grid you can track the status of these jobs using the 'job-status' script. It will scan the DANS job directory for all active jobs and will query the status of each of them. If a job has finished the 'job-status' script will retrieve the output automatically and will also record the job logging info in its corresponding DANS job directory. If there are no active jobs then the 'job-status' script performs no actions.

$ ./job-status -h
./job-status - check the status of all DANS 'soundbites' jobs in $HOME/dans/gridjobs
Usage: ./job-status [-q|--quiet] [-d|--debug] [-k|--keepgoing] [--archive=archive]
                    [--jobdir=dir] [--id=N] [--start=N] [--end=N]
Where:
  --keepgoing        tells ./job-status to keep going after an error
  --archive=archive  specifies the name of the archive (default='soundbites')
  --jobdir=dir       specifies the directory where the DANS jobs are stored 
                     (default='$HOME/dans/gridjobs')
  --id=N             specifies the DANS job id
  --start=N          specifies the starting DANS job id (default=1)
  --end=N            specifies the end DANS job id (default=ALL)

The flag '-q' or '--quiet' suppresses a lot of output. The flag '-d' or '--debug' produces a lot of extra output. You can combine '-q' and '-d'.

The name of the archive is determined by looking at the location of the script. Thus, if the 'job-info' script is in the directory '$HOME/dans/soundbites' then the archive name is deduced as 'soundbites'. This can be overruled using the '--archive' parameter.

The default location of the DANS job directory is $HOME/dans/gridjobs, but this can be overruled using the '--jobdir' parameter.

The job ID of the jobs to scan can be specified using the '--id', '--start' and '--end' parameters. The '--id' parameter specifies a single job ID and overrules the '--start' and '--end' parameters. The '--start' parameter specifies the job ID at which to start scanning, whereas the '--end' parameter specifies the job ID at which to stop.

job-info

The 'job-info' script can be used to display the status of all jobs for a particular archive in the DANS job directory. It will only display information and will not query active jobs. This script is intended for troubleshooting purposes mostly.

$ ./job-info --help
./job-info - print info on all DANS 'soundbites' jobs in $HOME/dans/gridjobs
Usage: ./job-info [-q|--quiet] [-d|--debug] [-k|--keepgoing] [--archive=archive]
                  [--jobdir=dir] [--id=N] [--start=N] [--end=N]
Where:
  --keepgoing        tells ./job-info to keep going after an error
  --archive=archive  specifies the name of the archive (default='soundbites')
  --jobdir=dir       specifies the directory where the DANS jobs are stored 
                     (default='$HOME/dans/gridjobs')
  --id=N             specifies the DANS job id
  --start=N          specifies the starting DANS job id (default=1)
  --end=N            specifies the end DANS job id (default=ALL)


The flag '-q' or '--quiet' suppresses a lot of output. The flag '-d' or '--debug' produces a lot of extra output. You can combine '-q' and '-d'.

The name of the archive is determined by looking at the location of the script. Thus, if the 'job-info' script is in the directory '$HOME/dans/soundbites' then the archive name is deduced as 'soundbites'. This can be overruled using the '--archive' parameter.

The default location of the DANS job directory is $HOME/dans/gridjobs, but this can be overruled using the '--jobdir' parameter.

The job ID of the jobs to scan can be specified using the '--id', '--start' and '--end' parameters. The '--id' parameter specifies a single job ID and overrules the '--start' and '--end' parameters. The '--start' parameter specifies the job ID at which to start scanning, whereas the '--end' parameter specifies the job ID at which to stop.

The DANS job directory

The default location of the DANS job directory is $HOME/dans/gridjobs. Underneath this directory you will find directories with the name of the DANS jobid. These jobids currently are 5 digit numbers, e.g. 00128. In each of the job directories the following information is recorded:

  • the job status
  • the job's JDL file
  • the jobid as seen by the grid WMS

For jobs that have completed the following files are also stored:

  • the log of the 'glite-wms-get-output' command which was used to retrieve the job output
  • the entire job log that was retrieved using the 'glite-wms-get-logging-info' command
  • a directory 'output' where the output files of that job are stored.

An example: gridjob 00128

The DANS grid job #00128 was run on June 12th 2012. It was a job to verify the MD5 checksums of an archive which was uploaded previously. The job completed successfully. This can all be seen by looking at the contents of the job directory:

$HOME/dans/gridjobs:
  Status=Cleared
  jdl
  job-get-output.log
  job-logging-info.log
  jobid
  output/adler32sums.txt
  output/md5sums.tar.gz
  output/stderror
  output/stdout

The first entry, 'Status=Cleared', is actually an empty file and it reflects the state of the job. The 'Status=Cleared' means that the job has completed and that its output was downloaded ("cleared") from the WMS. The following entries are possible for the 'Status=...' file:

  • Status=Submitted : means the job has been submitted but has not been scheduled yet
  • Status=Scheduled : means the job has been accepted by a grid site and is scheduled for execution
  • Status=Running, Status=ReallyRunning : means the job is now actively running. The difference between 'Running' and 'ReallyRunning' is mostly a historic artefact.
  • Status=Done : means the job has completed; the script that was run might have returned an error, but the WMS and batch system now consider the job 'done'.
  • Status=Cleared : means the job has completed and that its output was downloaded ("cleared") from the WMS.
  • Status=Aborted : means the job did not run successfully and was aborted by the grid WMS and/or batch system. The 'job-logging-info.log' file will contain details on why the job was aborted.
  • Status=Cancelled : means the job was cancelled by the user. The 'job-logging-info.log' file will contain details of who and when the job was cancelled.

Normally a job directory contains only a single 'Status=...' entry.

The file 'jdl' is the Job Description Language file that was used during the submission of the job. The contents of this file are:

 Executable = "check-archive.sh";
 Arguments = "RACM 1156 1167";
 Stdoutput = "stdout";
 StdError = "stderror";
 InputSandbox = { "check-archive.sh", "adler32sum", "md5deep" };
 OutputSandbox = { "stdout", "stderror", "adler32sums.txt", "md5sums.tar.gz" };
 Requirements = other.GlueCEPolicyMaxCPUTime >= 300;

The JDL file shows us that this was a 'check-archive' job to verify the .tar.gz files from the archive RACM, numbered RACM-1156.tar.gz upto and including RACM-1167.tar.gz. The input sandbox files lists the 'check-archive.sh' script itself and two binary programs that the script needs during execution. The first binary is used to calculate the ADLER32 checksum of the .tar.gz file itself, the second is the command used to calculate the MD5 checksums of an entire directory tree. The requested output files are 'stdout', 'stderror', 'adler32sums.txt' and 'md5sums.tar.gz'. These are the files we will expect in the job output directory.

The 'job-get-output.log' and 'job-logging-info.log' files are the log files from the commands 'glite-wms-job-output' and 'glite-wms-job-logging-info'. These commands are called by the 'job-status' script after the job has finished running. These log files are normally not needed but they contain a lot of debugging and troubleshooting information about the job itself. For example, information on when and where the job was run can be found in the 'job-logging-info.log' file.

The 'jobid' file is the jobid as recorded by the 'glite-wms-job-submit' command. It is only needed during while the job is running, as the jobid is discarded by the WMS after the job has been "cleared" using the 'glite-wms-job-output' command.

The directory 'output' contains the expected output files. The file 'adler32sums.txt' is the list of ADLER32 checksums of all .tar.gz files that have been verified. The 'mdsums.tar.gz' file is a tarball containing the MD5 checksums files of the contents of each .tar.gz file that has been verified. The reason that the RACM-*.tar.md5sum files are again stored in a single .tar.gz file is that it simplifies the JDL file used for submission.