Difference between revisions of "Dans Data Compress"

From BiGGrid Wiki
Jump to navigation Jump to search
 
(19 intermediate revisions by the same user not shown)
Line 1: Line 1:
After the tarballs have been uploaded to the grid, the next step is to compress the
+
After the tarballs have been uploaded to the grid, the next step is to compress the tarballs to so-called .tar.gz files to save space. This compression step is done
tarballs to so-called .tar.gz files to save space. This compression step is done
+
on the grid itself, hence we need to submit a set of jobs to the grid. The '<tt>compress-tar</tt>' script does this automatically.
on the grid itself, hence we need to submit a set of jobs to the grid.  
+
Before starting either the '<tt>compress-tar</tt>' script or the '<tt>check-tar</tt>' script for the first time a special ''gridjobs'' directory needs to be created. The default location for this directory is
The 'compress-tar' script does this automatically. It performs the following actions:
+
$HOME/dans/gridjobs
* determine the list of tarballs (.tar files) that need to be compressed. The script uses the LFC to find the list, but the user can influence which tarballs need to be compressed using the '--start=' and '--end=' parameters
+
so a
* split the list into several grid jobs. Each job will process a fixed set of tarballs (currently set to CHUNK_SIZE=15).
+
$ mkdir -p $HOME/dans/gridjobs
* create job directories and job description (JDL) files for each of the jobs. This is done using the template JDL file '<tt>Compress.jdl</tt>'.
+
is sufficient. Next, start the '<tt>compress-tar</tt>' script:
* submit each job to the grid WMS and record the jobid in the DANS job directory.
+
$ ./compress-tar
 +
Found 84 tar balls in lfc.grid.sara.nl:/grid/dans/soundbites
 +
Splitting into 6 jobs, start=1, end=84
 +
Delegating proxy
 +
Submitting DANS job 149: https://wms1.grid.sara.nl:9000/5To794h9GaRL-mPH9E7TpQ
 +
Submitting DANS job 150: https://wms1.grid.sara.nl:9000/W9YklyQ6MFeKsvzeHBVZXg
 +
Submitting DANS job 151: https://wms1.grid.sara.nl:9000/bmcj7Ja548EAHT4NmrNnPg
 +
Submitting DANS job 152: https://wms1.grid.sara.nl:9000/O0BB1AQuQQ8llmYEmAdIZQ
 +
Submitting DANS job 153: https://wms1.grid.sara.nl:9000/UdxeMCitwmPKKhoVXcB_ug
 +
Submitting DANS job 154: https://wms1.grid.sara.nl:9000/LlxU9gTVTMvxi2sSjsqr9g
 +
 +
The 'soundbites' archive consists of 84 tarballs which need to be compressed. Each gridjob will compress 15 tarballs, hence a total of 6 jobs were submitted.
 +
After the jobs have been submitted to the grid you can track the status of these jobs using the '<tt>job-status</tt>' script:
 +
$ ./job-status
 +
00149  https://wms1.grid.sara.nl:9000/5To794h9GaRL-mPH9E7TpQ        Status=Running
 +
00150  https://wms1.grid.sara.nl:9000/W9YklyQ6MFeKsvzeHBVZXg        Status=Running
 +
00151  https://wms1.grid.sara.nl:9000/bmcj7Ja548EAHT4NmrNnPg        Status=Running
 +
00152  https://wms1.grid.sara.nl:9000/O0BB1AQuQQ8llmYEmAdIZQ        Status=Running
 +
00153  https://wms1.grid.sara.nl:9000/UdxeMCitwmPKKhoVXcB_ug        Status=Running
 +
00154  https://wms1.grid.sara.nl:9000/LlxU9gTVTMvxi2sSjsqr9g        Status=Running
 +
 +
====Notes====
 +
* the order in which the job are executed on the grid is not necessarily the same as the order in which they are submitted.
 +
* the grid job ids, starting with https://, look like URLs and that's exactly what they are. The user who submits the job can view the status of that job using a webbrowser, provided that the user's grid certificate is installed in that browser.
  
The default location of the DANS job directory is <tt>$HOME/dans/gridjobs</tt>. The '<tt>compress-tar</tt>' script requires a valid grid proxy certificate to operate, it
+
See the section [[DANS Job Scripts]] for more details on both the '<tt>compress-tar</tt>' and the '<tt>job-status</tt>' scripts.
will abort if it cannot find one.
 
  
The '<tt>compress-tar</tt>' script accepts the following parameters:
+
===Job output===
 +
When a grid job is finished the '<tt>job-status</tt>' script automatically retrieves the output:
 +
00154  https://wms1.grid.sara.nl:9000/LlxU9gTVTMvxi2sSjsqr9g        Status=Done (Exit code=0)
 +
        Retrieving job output into $HOME/dans/gridjobs/00154/output
  
  Usage: ./compress-tar [-q|--quiet] [-d|--debug] [-k|--keepgoing] [--start=N] [--end=N]
+
The status message 'Done (Exit code=0)' means that the job ran successfully and returned an exit code 0, which indicates success.
  Where:
+
In the directory '$HOME/dans/gridjobs/00154/output' there are two files, an empty file '<tt>stderr</tt>' and the job's output file '<tt>stdout</tt>':
  --keepgoing    tells ./compress-tar to keep going after an error
+
  2012/11/13-14:47:45 Job start: [soundbites 76 84]
  --start=N      specifies the starting index of the tarballs to check (default=1)
+
  Retrieving file lfn://grid/dans/soundbites/soundbites-0076.tar
  --end=N        specifies the end index of the tarballs to check (default=ALL)
+
Storing file lfn://grid/dans/soundbites/soundbites-0076.tar.gz
 
+
guid:f5c00083-554a-48b5-b4c5-645b68b75402
The flag '-q' or '--quiet' suppresses a lot of output.
+
[...]
The flag '-d' or '--debug' produces a lot of extra output. You can combine '-q' and '-d'.
+
Retrieving file lfn://grid/dans/soundbites/soundbites-0084.tar
 
+
Storing file lfn://grid/dans/soundbites/soundbites-0084.tar.gz
After the jobs have been submitted to the grid you can track the status of these jobs
+
guid:3d90fd6c-0db5-4e65-b769-1aefaccac9dc
using the '<tt>job-status</tt>' script. See the section [[DANS Job Scripts]] for more details.
+
2012/11/13-16:21:48 Job end
  
 
==Important notes==
 
==Important notes==
 
* The tarballs need to be compressed only once. It is done on the grid because it is much faster to do it that way, instead of on the DANS dataserver.
 
* The tarballs need to be compressed only once. It is done on the grid because it is much faster to do it that way, instead of on the DANS dataserver.
* After the .tar.gz files have been created and have been verified (see "Phase 3" for more details) the original tarballs need to be deleted from the grid storage. This is not done automatically. An effective commandline to delete all files named '.tar' from a  single directory on the LFC is  
+
* After the .tar.gz files have been created and have been verified (see [[DANS_Data_Management#DANS_Workflow_phase_3:_Verifying_data|Phase 3: Verifying data]] for more details) the original tarballs need to be deleted from the grid storage. This is not done automatically. An effective commandline-set to delete all files named '.tar' from a  single directory on the LFC is  
    lfc-ls /grid/dans/$ARCHIVE | grep ".tar$" > tarball-list
+
$ lfcpath=/grid/dans/$ARCHIVE
    lcg-del -a `cat tarball-list`
+
$ lfc-ls $lfcpath | grep ".tar$" | sed 's|^|lfn:$lfcpath/|' > tarball-list
  CHECK!
+
$ lcg-del -a -f tarball-list
 +
Note the use of the '|' character in the '<tt>sed</tt>' command as the separator character, instead of the usual '/' character. This way there is no need to escape the slashes in the LFC path variable.
 +
Also note the extra slash at the end of the <tt>lfcpath</tt> variable.
 +
 
 
The '<tt>lcg-del</tt>' command can take quite some time to complete.
 
The '<tt>lcg-del</tt>' command can take quite some time to complete.

Latest revision as of 14:26, 11 December 2012

After the tarballs have been uploaded to the grid, the next step is to compress the tarballs to so-called .tar.gz files to save space. This compression step is done on the grid itself, hence we need to submit a set of jobs to the grid. The 'compress-tar' script does this automatically. Before starting either the 'compress-tar' script or the 'check-tar' script for the first time a special gridjobs directory needs to be created. The default location for this directory is

$HOME/dans/gridjobs

so a

$ mkdir -p $HOME/dans/gridjobs

is sufficient. Next, start the 'compress-tar' script:

$ ./compress-tar
Found 84 tar balls in lfc.grid.sara.nl:/grid/dans/soundbites
Splitting into 6 jobs, start=1, end=84
Delegating proxy
Submitting DANS job 149: https://wms1.grid.sara.nl:9000/5To794h9GaRL-mPH9E7TpQ
Submitting DANS job 150: https://wms1.grid.sara.nl:9000/W9YklyQ6MFeKsvzeHBVZXg
Submitting DANS job 151: https://wms1.grid.sara.nl:9000/bmcj7Ja548EAHT4NmrNnPg
Submitting DANS job 152: https://wms1.grid.sara.nl:9000/O0BB1AQuQQ8llmYEmAdIZQ
Submitting DANS job 153: https://wms1.grid.sara.nl:9000/UdxeMCitwmPKKhoVXcB_ug
Submitting DANS job 154: https://wms1.grid.sara.nl:9000/LlxU9gTVTMvxi2sSjsqr9g

The 'soundbites' archive consists of 84 tarballs which need to be compressed. Each gridjob will compress 15 tarballs, hence a total of 6 jobs were submitted. After the jobs have been submitted to the grid you can track the status of these jobs using the 'job-status' script:

$ ./job-status
00149  https://wms1.grid.sara.nl:9000/5To794h9GaRL-mPH9E7TpQ         Status=Running
00150  https://wms1.grid.sara.nl:9000/W9YklyQ6MFeKsvzeHBVZXg         Status=Running
00151  https://wms1.grid.sara.nl:9000/bmcj7Ja548EAHT4NmrNnPg         Status=Running
00152  https://wms1.grid.sara.nl:9000/O0BB1AQuQQ8llmYEmAdIZQ         Status=Running
00153  https://wms1.grid.sara.nl:9000/UdxeMCitwmPKKhoVXcB_ug         Status=Running
00154  https://wms1.grid.sara.nl:9000/LlxU9gTVTMvxi2sSjsqr9g         Status=Running

Notes

  • the order in which the job are executed on the grid is not necessarily the same as the order in which they are submitted.
  • the grid job ids, starting with https://, look like URLs and that's exactly what they are. The user who submits the job can view the status of that job using a webbrowser, provided that the user's grid certificate is installed in that browser.

See the section DANS Job Scripts for more details on both the 'compress-tar' and the 'job-status' scripts.

Job output

When a grid job is finished the 'job-status' script automatically retrieves the output:

00154  https://wms1.grid.sara.nl:9000/LlxU9gTVTMvxi2sSjsqr9g         Status=Done (Exit code=0)
       Retrieving job output into $HOME/dans/gridjobs/00154/output

The status message 'Done (Exit code=0)' means that the job ran successfully and returned an exit code 0, which indicates success. In the directory '$HOME/dans/gridjobs/00154/output' there are two files, an empty file 'stderr' and the job's output file 'stdout':

2012/11/13-14:47:45 Job start: [soundbites 76 84]
Retrieving file lfn://grid/dans/soundbites/soundbites-0076.tar
Storing file lfn://grid/dans/soundbites/soundbites-0076.tar.gz
guid:f5c00083-554a-48b5-b4c5-645b68b75402
[...]
Retrieving file lfn://grid/dans/soundbites/soundbites-0084.tar
Storing file lfn://grid/dans/soundbites/soundbites-0084.tar.gz
guid:3d90fd6c-0db5-4e65-b769-1aefaccac9dc
2012/11/13-16:21:48 Job end

Important notes

  • The tarballs need to be compressed only once. It is done on the grid because it is much faster to do it that way, instead of on the DANS dataserver.
  • After the .tar.gz files have been created and have been verified (see Phase 3: Verifying data for more details) the original tarballs need to be deleted from the grid storage. This is not done automatically. An effective commandline-set to delete all files named '.tar' from a single directory on the LFC is
$ lfcpath=/grid/dans/$ARCHIVE
$ lfc-ls $lfcpath | grep ".tar$" | sed 's|^|lfn:$lfcpath/|' > tarball-list
$ lcg-del -a -f tarball-list

Note the use of the '|' character in the 'sed' command as the separator character, instead of the usual '/' character. This way there is no need to escape the slashes in the LFC path variable. Also note the extra slash at the end of the lfcpath variable.

The 'lcg-del' command can take quite some time to complete.