Difference between revisions of "Dans Data Compress"

From BiGGrid Wiki
Jump to navigation Jump to search
(Created page with "After the tarballs have been uploaded to the grid, the next step is to compress the tarballs to so-called .tar.gz files to save space. This compression step is done on the grid i...")
 
Line 3: Line 3:
 
on the grid itself, hence we need to submit a set of jobs to the grid.  
 
on the grid itself, hence we need to submit a set of jobs to the grid.  
 
The 'compress-tar' script does this automatically. It performs the following actions:
 
The 'compress-tar' script does this automatically. It performs the following actions:
- determine the list of tarballs (.tar files) that need to be compressed. The script
+
* determine the list of tarballs (.tar files) that need to be compressed. The script uses the LFC to find the list, but the user can influence which tarballs need to be compressed using the '--start=' and '--end=' parameters
  uses the LFC to find the list, but the user can influence which tarballs need to
+
* split the list into several grid jobs. Each job will process a fixed set of tarballs (currently set to CHUNK_SIZE=15).
  be compressed using the '--start=' and '--end=' parameters
+
* create job directories and job description (JDL) files for each of the jobs. This is done using the template JDL file '<tt>Compress.jdl</tt>'.
- split the list into several grid jobs. Each job will process a fixed set of
+
* submit each job to the grid WMS and record the jobid in the DANS job directory.
  tarballs (currently set to CHUNK_SIZE=15).
 
- create job directories and job description (JDL) files for each of the jobs.
 
  This is done using the template JDL file 'Compress.jdl'.
 
- submit each job to the grid WMS and record the jobid in the DANS job directory.
 
  
The default location of the DANS job directory is $HOME/dans/gridjobs.
+
The default location of the DANS job directory is <tt>$HOME/dans/gridjobs</tt>. The '<tt>compress-tar</tt>' script requires a valid grid proxy certificate to operate, it  
The 'compress-tar' script requires a valid grid proxy certificate to operate, it  
 
 
will abort if it cannot find one.
 
will abort if it cannot find one.
  
The 'compress-tar' script accepts the following parameters:
+
The '<tt>compress-tar</tt>' script accepts the following parameters:
  
 
  Usage: ./compress-tar [-q|--quiet] [-d|--debug] [-k|--keepgoing] [--start=N] [--end=N]
 
  Usage: ./compress-tar [-q|--quiet] [-d|--debug] [-k|--keepgoing] [--start=N] [--end=N]
Line 32: Line 27:
 
Important notes
 
Important notes
 
==============
 
==============
- The tarballs need to be compressed only once. It is done on the grid because it is much
+
* The tarballs need to be compressed only once. It is done on the grid because it is much faster to do it that way, instead of on the DANS dataserver.
  faster to do it that way, instead of on the DANS dataserver.
+
* After the .tar.gz files have been created and have been verified (see "Phase 3" for more details) the original tarballs need to be deleted from the grid storage. This is not done automatically. An effective commandline to delete all files named '.tar' from a single directory on the LFC is  
- After the .tar.gz files have been created and have been verified (see "Phase 3" for more
 
  details) the original tarballs need to be deleted from the grid storage. This is not
 
  done automatically. An effective commandline to delete all files named '.tar' from a  
 
  single directory on the LFC is  
 
 
     lfc-ls /grid/dans/$ARCHIVE | grep ".tar$" > tarball-list
 
     lfc-ls /grid/dans/$ARCHIVE | grep ".tar$" > tarball-list
 
     lcg-del -a `cat tarball-list`
 
     lcg-del -a `cat tarball-list`
 
   CHECK!
 
   CHECK!
  The 'lcg-del' command can take quite some time to complete.
+
The 'lcg-del' command can take quite some time to complete.

Revision as of 17:23, 8 November 2012

After the tarballs have been uploaded to the grid, the next step is to compress the tarballs to so-called .tar.gz files to save space. This compression step is done on the grid itself, hence we need to submit a set of jobs to the grid. The 'compress-tar' script does this automatically. It performs the following actions:

  • determine the list of tarballs (.tar files) that need to be compressed. The script uses the LFC to find the list, but the user can influence which tarballs need to be compressed using the '--start=' and '--end=' parameters
  • split the list into several grid jobs. Each job will process a fixed set of tarballs (currently set to CHUNK_SIZE=15).
  • create job directories and job description (JDL) files for each of the jobs. This is done using the template JDL file 'Compress.jdl'.
  • submit each job to the grid WMS and record the jobid in the DANS job directory.

The default location of the DANS job directory is $HOME/dans/gridjobs. The 'compress-tar' script requires a valid grid proxy certificate to operate, it will abort if it cannot find one.

The 'compress-tar' script accepts the following parameters:

Usage: ./compress-tar [-q|--quiet] [-d|--debug] [-k|--keepgoing] [--start=N] [--end=N]
Where:
  --keepgoing    tells ./compress-tar to keep going after an error
  --start=N      specifies the starting index of the tarballs to check (default=1)
  --end=N        specifies the end index of the tarballs to check (default=ALL)

The flag '-q' or '--quiet' suppresses a lot of output. The flag '-d' or '--debug' produces a lot of extra output. You can combine '-q' and '-d'.

After the jobs have been submitted to the grid you can track the status of these jobs using the 'job-status' script. See the section "DANS Job Scripts" for more details.

Important notes

==
  • The tarballs need to be compressed only once. It is done on the grid because it is much faster to do it that way, instead of on the DANS dataserver.
  • After the .tar.gz files have been created and have been verified (see "Phase 3" for more details) the original tarballs need to be deleted from the grid storage. This is not done automatically. An effective commandline to delete all files named '.tar' from a single directory on the LFC is
    lfc-ls /grid/dans/$ARCHIVE | grep ".tar$" > tarball-list
    lcg-del -a `cat tarball-list`
 CHECK!

The 'lcg-del' command can take quite some time to complete.