Dans Data Compress
After the tarballs have been uploaded to the grid, the next step is to compress the tarballs to so-called .tar.gz files to save space. This compression step is done on the grid itself, hence we need to submit a set of jobs to the grid. The 'compress-tar' script does this automatically. It performs the following actions:
- determine the list of tarballs (.tar files) that need to be compressed. The script uses the LFC to find the list, but the user can influence which tarballs need to be compressed using the '--start=' and '--end=' parameters
- split the list into several grid jobs. Each job will process a fixed set of tarballs (currently set to CHUNK_SIZE=15).
- create job directories and job description (JDL) files for each of the jobs. This is done using the template JDL file 'Compress.jdl'.
- submit each job to the grid WMS and record the jobid in the DANS job directory.
The default location of the DANS job directory is $HOME/dans/gridjobs. The 'compress-tar' script requires a valid grid proxy certificate to operate, it will abort if it cannot find one.
The 'compress-tar' script accepts the following parameters:
Usage: ./compress-tar [-q|--quiet] [-d|--debug] [-k|--keepgoing] [--start=N] [--end=N] Where: --keepgoing tells ./compress-tar to keep going after an error --start=N specifies the starting index of the tarballs to check (default=1) --end=N specifies the end index of the tarballs to check (default=ALL)
The flag '-q' or '--quiet' suppresses a lot of output. The flag '-d' or '--debug' produces a lot of extra output. You can combine '-q' and '-d'.
After the jobs have been submitted to the grid you can track the status of these jobs using the 'job-status' script. See the section "DANS Job Scripts" for more details.
Important notes
- The tarballs need to be compressed only once. It is done on the grid because it is much faster to do it that way, instead of on the DANS dataserver.
- After the .tar.gz files have been created and have been verified (see "Phase 3" for more details) the original tarballs need to be deleted from the grid storage. This is not done automatically. An effective commandline to delete all files named '.tar' from a single directory on the LFC is
lfc-ls /grid/dans/$ARCHIVE | grep ".tar$" > tarball-list lcg-del -a `cat tarball-list` CHECK!
The 'lcg-del' command can take quite some time to complete.