Difference between revisions of "Dans Data Compress"

From BiGGrid Wiki
Jump to navigation Jump to search
m
Line 32: Line 32:
 
When a grid job is finished the '<tt>job-status</tt>' script automatically retrieves the output:
 
When a grid job is finished the '<tt>job-status</tt>' script automatically retrieves the output:
 
  00154  https://wms1.grid.sara.nl:9000/LlxU9gTVTMvxi2sSjsqr9g        Status=Done (Exit code=0)
 
  00154  https://wms1.grid.sara.nl:9000/LlxU9gTVTMvxi2sSjsqr9g        Status=Done (Exit code=0)
         Retrieving job output into $HOME/dans/gridjobs/00054/output
+
         Retrieving job output into $HOME/dans/gridjobs/00154/output
  
 
The status message 'Done (Exit code=0)' means that the job ran successfully and returned an exit code 0, which indicates success.
 
The status message 'Done (Exit code=0)' means that the job ran successfully and returned an exit code 0, which indicates success.
In the directory '$HOME/dans/gridjobs/00054/output' there are two files, an empty file '<tt>stderr</tt>' and the job's output file '<tt>stdout</tt>':
+
In the directory '$HOME/dans/gridjobs/00154/output' there are two files, an empty file '<tt>stderr</tt>' and the job's output file '<tt>stdout</tt>':
 
  2012/11/13-14:47:45 Job start: [soundbites 76 84]
 
  2012/11/13-14:47:45 Job start: [soundbites 76 84]
 
  Retrieving file lfn://grid/dans/soundbites/soundbites-0076.tar
 
  Retrieving file lfn://grid/dans/soundbites/soundbites-0076.tar

Revision as of 14:37, 15 November 2012

After the tarballs have been uploaded to the grid, the next step is to compress the tarballs to so-called .tar.gz files to save space. This compression step is done on the grid itself, hence we need to submit a set of jobs to the grid. The 'compress-tar' script does this automatically.

$ ./compress-tar
Found 84 tar balls in lfc.grid.sara.nl:/grid/dans/soundbites
Splitting into 6 jobs, start=1, end=84
Delegating proxy
Submitting DANS job 149: https://wms1.grid.sara.nl:9000/5To794h9GaRL-mPH9E7TpQ
Submitting DANS job 150: https://wms1.grid.sara.nl:9000/W9YklyQ6MFeKsvzeHBVZXg
Submitting DANS job 151: https://wms1.grid.sara.nl:9000/bmcj7Ja548EAHT4NmrNnPg
Submitting DANS job 152: https://wms1.grid.sara.nl:9000/O0BB1AQuQQ8llmYEmAdIZQ
Submitting DANS job 153: https://wms1.grid.sara.nl:9000/UdxeMCitwmPKKhoVXcB_ug
Submitting DANS job 154: https://wms1.grid.sara.nl:9000/LlxU9gTVTMvxi2sSjsqr9g

The 'soundbites' archive consists of 84 tarballs which need to be compressed. Each gridjob will compress 15 tarballs, hence a total of 6 jobs were submitted. After the jobs have been submitted to the grid you can track the status of these jobs using the 'job-status' script:

$ ./job-status
00149  https://wms1.grid.sara.nl:9000/5To794h9GaRL-mPH9E7TpQ         Status=Running
00150  https://wms1.grid.sara.nl:9000/W9YklyQ6MFeKsvzeHBVZXg         Status=Running
00151  https://wms1.grid.sara.nl:9000/bmcj7Ja548EAHT4NmrNnPg         Status=Running
00152  https://wms1.grid.sara.nl:9000/O0BB1AQuQQ8llmYEmAdIZQ         Status=Running
00153  https://wms1.grid.sara.nl:9000/UdxeMCitwmPKKhoVXcB_ug         Status=Running
00154  https://wms1.grid.sara.nl:9000/LlxU9gTVTMvxi2sSjsqr9g         Status=Running

Notes

  • the order in which the job are executed on the grid is not necessarily the same as the order in which they are submitted.
  • the grid job ids, starting with https://, look like URLs and that's exactly what they are. The user who submits the job can view the status of that job using a webbrowser, provided that the user's grid certificate is installed in that browser.

See the section DANS Job Scripts for more details on both the 'compress-tar' and the 'job-status' scripts.

Job output

When a grid job is finished the 'job-status' script automatically retrieves the output:

00154  https://wms1.grid.sara.nl:9000/LlxU9gTVTMvxi2sSjsqr9g         Status=Done (Exit code=0)
       Retrieving job output into $HOME/dans/gridjobs/00154/output

The status message 'Done (Exit code=0)' means that the job ran successfully and returned an exit code 0, which indicates success. In the directory '$HOME/dans/gridjobs/00154/output' there are two files, an empty file 'stderr' and the job's output file 'stdout':

2012/11/13-14:47:45 Job start: [soundbites 76 84]
Retrieving file lfn://grid/dans/soundbites/soundbites-0076.tar
Storing file lfn://grid/dans/soundbites/soundbites-0076.tar.gz
guid:f5c00083-554a-48b5-b4c5-645b68b75402
[...]
Retrieving file lfn://grid/dans/soundbites/soundbites-0084.tar
Storing file lfn://grid/dans/soundbites/soundbites-0084.tar.gz
guid:3d90fd6c-0db5-4e65-b769-1aefaccac9dc
2012/11/13-16:21:48 Job end

Important notes

  • The tarballs need to be compressed only once. It is done on the grid because it is much faster to do it that way, instead of on the DANS dataserver.
  • After the .tar.gz files have been created and have been verified (see Phase 3: Verifying data for more details) the original tarballs need to be deleted from the grid storage. This is not done automatically. An effective commandline-set to delete all files named '.tar' from a single directory on the LFC is
$ lfcpath=/grid/dans/$ARCHIVE
$ lfc-ls $lfcpath | grep ".tar$" | sed 's|^|$lfcpath|' > tarball-list
$ lcg-del -a -f tarball-list

Note the use of the '|' character in the 'sed' command as the separator character, instead of the usual '/' character. This way there is no need to escape the slashes in the LFC path variable.

The 'lcg-del' command can take quite some time to complete.