Dans Data Compress

From BiGGrid Wiki
Revision as of 17:21, 8 November 2012 by Janjust@nikhef.nl (talk | contribs) (Created page with "After the tarballs have been uploaded to the grid, the next step is to compress the tarballs to so-called .tar.gz files to save space. This compression step is done on the grid i...")
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

After the tarballs have been uploaded to the grid, the next step is to compress the tarballs to so-called .tar.gz files to save space. This compression step is done on the grid itself, hence we need to submit a set of jobs to the grid. The 'compress-tar' script does this automatically. It performs the following actions: - determine the list of tarballs (.tar files) that need to be compressed. The script

 uses the LFC to find the list, but the user can influence which tarballs need to
 be compressed using the '--start=' and '--end=' parameters

- split the list into several grid jobs. Each job will process a fixed set of

 tarballs (currently set to CHUNK_SIZE=15).

- create job directories and job description (JDL) files for each of the jobs.

 This is done using the template JDL file 'Compress.jdl'.

- submit each job to the grid WMS and record the jobid in the DANS job directory.

The default location of the DANS job directory is $HOME/dans/gridjobs. The 'compress-tar' script requires a valid grid proxy certificate to operate, it will abort if it cannot find one.

The 'compress-tar' script accepts the following parameters:

Usage: ./compress-tar [-q|--quiet] [-d|--debug] [-k|--keepgoing] [--start=N] [--end=N]
Where:
  --keepgoing    tells ./compress-tar to keep going after an error
  --start=N      specifies the starting index of the tarballs to check (default=1)
  --end=N        specifies the end index of the tarballs to check (default=ALL)

The flag '-q' or '--quiet' suppresses a lot of output. The flag '-d' or '--debug' produces a lot of extra output. You can combine '-q' and '-d'.

After the jobs have been submitted to the grid you can track the status of these jobs using the 'job-status' script. See the section "DANS Job Scripts" for more details.

Important notes

==

- The tarballs need to be compressed only once. It is done on the grid because it is much

 faster to do it that way, instead of on the DANS dataserver.

- After the .tar.gz files have been created and have been verified (see "Phase 3" for more

 details) the original tarballs need to be deleted from the grid storage. This is not
 done automatically. An effective commandline to delete all files named '.tar' from a 
 single directory on the LFC is 
    lfc-ls /grid/dans/$ARCHIVE | grep ".tar$" > tarball-list
    lcg-del -a `cat tarball-list`
 CHECK!
 The 'lcg-del' command can take quite some time to complete.