Difference between revisions of "Dans Data Compress"
| (14 intermediate revisions by the same user not shown) | |||
| Line 1: | Line 1: | ||
After the tarballs have been uploaded to the grid, the next step is to compress the tarballs to so-called .tar.gz files to save space. This compression step is done  | After the tarballs have been uploaded to the grid, the next step is to compress the tarballs to so-called .tar.gz files to save space. This compression step is done  | ||
| − | on the grid itself, hence we need to submit a set of jobs to the grid.    | + | on the grid itself, hence we need to submit a set of jobs to the grid. The '<tt>compress-tar</tt>' script does this automatically.  | 
| − | The 'compress-tar' script does this automatically.  | + | Before starting either the '<tt>compress-tar</tt>' script or the '<tt>check-tar</tt>' script for the first time a special ''gridjobs'' directory needs to be created. The default location for this directory is  | 
| + |  $HOME/dans/gridjobs  | ||
| + | so a   | ||
| + |  $ mkdir -p $HOME/dans/gridjobs  | ||
| + | is sufficient. Next, start the '<tt>compress-tar</tt>' script:  | ||
  $ ./compress-tar  |   $ ./compress-tar  | ||
| − |   Found 84 tar balls in /grid/dans/soundbites  | + |   Found 84 tar balls in lfc.grid.sara.nl:/grid/dans/soundbites  | 
  Splitting into 6 jobs, start=1, end=84  |   Splitting into 6 jobs, start=1, end=84  | ||
  Delegating proxy  |   Delegating proxy  | ||
| Line 16: | Line 20: | ||
After the jobs have been submitted to the grid you can track the status of these jobs using the '<tt>job-status</tt>' script:  | After the jobs have been submitted to the grid you can track the status of these jobs using the '<tt>job-status</tt>' script:  | ||
  $ ./job-status  |   $ ./job-status  | ||
| − | + |   00149  https://wms1.grid.sara.nl:9000/5To794h9GaRL-mPH9E7TpQ         Status=Running  | |
| − | + |   00150  https://wms1.grid.sara.nl:9000/W9YklyQ6MFeKsvzeHBVZXg         Status=Running  | |
| − | + |   00151  https://wms1.grid.sara.nl:9000/bmcj7Ja548EAHT4NmrNnPg         Status=Running  | |
| − | + |   00152  https://wms1.grid.sara.nl:9000/O0BB1AQuQQ8llmYEmAdIZQ         Status=Running  | |
| − | + |   00153  https://wms1.grid.sara.nl:9000/UdxeMCitwmPKKhoVXcB_ug         Status=Running  | |
| − | + |   00154  https://wms1.grid.sara.nl:9000/LlxU9gTVTMvxi2sSjsqr9g         Status=Running  | |
| − | + | ||
| − | ===Notes===  | + | ====Notes====  | 
* the order in which the job are executed on the grid is not necessarily the same as the order in which they are submitted.  | * the order in which the job are executed on the grid is not necessarily the same as the order in which they are submitted.  | ||
* the grid job ids, starting with https://, look like URLs and that's exactly what they are. The user who submits the job can view the status of that job using a webbrowser, provided that the user's grid certificate is installed in that browser.  | * the grid job ids, starting with https://, look like URLs and that's exactly what they are. The user who submits the job can view the status of that job using a webbrowser, provided that the user's grid certificate is installed in that browser.  | ||
See the section [[DANS Job Scripts]] for more details on both the '<tt>compress-tar</tt>' and the '<tt>job-status</tt>' scripts.  | See the section [[DANS Job Scripts]] for more details on both the '<tt>compress-tar</tt>' and the '<tt>job-status</tt>' scripts.  | ||
| + | |||
| + | ===Job output===  | ||
| + | When a grid job is finished the '<tt>job-status</tt>' script automatically retrieves the output:  | ||
| + |  00154  https://wms1.grid.sara.nl:9000/LlxU9gTVTMvxi2sSjsqr9g         Status=Done (Exit code=0)  | ||
| + |         Retrieving job output into $HOME/dans/gridjobs/00154/output  | ||
| + | |||
| + | The status message 'Done (Exit code=0)' means that the job ran successfully and returned an exit code 0, which indicates success.  | ||
| + | In the directory '$HOME/dans/gridjobs/00154/output' there are two files, an empty file '<tt>stderr</tt>' and the job's output file '<tt>stdout</tt>':  | ||
| + |  2012/11/13-14:47:45 Job start: [soundbites 76 84]  | ||
| + |  Retrieving file lfn://grid/dans/soundbites/soundbites-0076.tar  | ||
| + |  Storing file lfn://grid/dans/soundbites/soundbites-0076.tar.gz  | ||
| + |  guid:f5c00083-554a-48b5-b4c5-645b68b75402  | ||
| + |  [...]  | ||
| + |  Retrieving file lfn://grid/dans/soundbites/soundbites-0084.tar  | ||
| + |  Storing file lfn://grid/dans/soundbites/soundbites-0084.tar.gz  | ||
| + |  guid:3d90fd6c-0db5-4e65-b769-1aefaccac9dc  | ||
| + |  2012/11/13-16:21:48 Job end  | ||
==Important notes==  | ==Important notes==  | ||
* The tarballs need to be compressed only once. It is done on the grid because it is much faster to do it that way, instead of on the DANS dataserver.  | * The tarballs need to be compressed only once. It is done on the grid because it is much faster to do it that way, instead of on the DANS dataserver.  | ||
| − | * After the .tar.gz files have been created and have been verified (see   | + | * After the .tar.gz files have been created and have been verified (see [[DANS_Data_Management#DANS_Workflow_phase_3:_Verifying_data|Phase 3: Verifying data]] for more details) the original tarballs need to be deleted from the grid storage. This is not done automatically. An effective commandline-set to delete all files named '.tar' from a  single directory on the LFC is    | 
| − | + |  $ lfcpath=/grid/dans/$ARCHIVE  | |
| − | + |  $ lfc-ls $lfcpath | grep ".tar$" | sed 's|^|lfn:$lfcpath/|' > tarball-list  | |
| − | + |  $ lcg-del -a -f tarball-list  | |
| + | Note the use of the '|' character in the '<tt>sed</tt>' command as the separator character, instead of the usual '/' character. This way there is no need to escape the slashes in the LFC path variable.  | ||
| + | Also note the extra slash at the end of the <tt>lfcpath</tt> variable.  | ||
| + | |||
The '<tt>lcg-del</tt>' command can take quite some time to complete.  | The '<tt>lcg-del</tt>' command can take quite some time to complete.  | ||
Latest revision as of 12:26, 11 December 2012
After the tarballs have been uploaded to the grid, the next step is to compress the tarballs to so-called .tar.gz files to save space. This compression step is done on the grid itself, hence we need to submit a set of jobs to the grid. The 'compress-tar' script does this automatically. Before starting either the 'compress-tar' script or the 'check-tar' script for the first time a special gridjobs directory needs to be created. The default location for this directory is
$HOME/dans/gridjobs
so a
$ mkdir -p $HOME/dans/gridjobs
is sufficient. Next, start the 'compress-tar' script:
$ ./compress-tar Found 84 tar balls in lfc.grid.sara.nl:/grid/dans/soundbites Splitting into 6 jobs, start=1, end=84 Delegating proxy Submitting DANS job 149: https://wms1.grid.sara.nl:9000/5To794h9GaRL-mPH9E7TpQ Submitting DANS job 150: https://wms1.grid.sara.nl:9000/W9YklyQ6MFeKsvzeHBVZXg Submitting DANS job 151: https://wms1.grid.sara.nl:9000/bmcj7Ja548EAHT4NmrNnPg Submitting DANS job 152: https://wms1.grid.sara.nl:9000/O0BB1AQuQQ8llmYEmAdIZQ Submitting DANS job 153: https://wms1.grid.sara.nl:9000/UdxeMCitwmPKKhoVXcB_ug Submitting DANS job 154: https://wms1.grid.sara.nl:9000/LlxU9gTVTMvxi2sSjsqr9g
The 'soundbites' archive consists of 84 tarballs which need to be compressed. Each gridjob will compress 15 tarballs, hence a total of 6 jobs were submitted. After the jobs have been submitted to the grid you can track the status of these jobs using the 'job-status' script:
$ ./job-status 00149 https://wms1.grid.sara.nl:9000/5To794h9GaRL-mPH9E7TpQ Status=Running 00150 https://wms1.grid.sara.nl:9000/W9YklyQ6MFeKsvzeHBVZXg Status=Running 00151 https://wms1.grid.sara.nl:9000/bmcj7Ja548EAHT4NmrNnPg Status=Running 00152 https://wms1.grid.sara.nl:9000/O0BB1AQuQQ8llmYEmAdIZQ Status=Running 00153 https://wms1.grid.sara.nl:9000/UdxeMCitwmPKKhoVXcB_ug Status=Running 00154 https://wms1.grid.sara.nl:9000/LlxU9gTVTMvxi2sSjsqr9g Status=Running
Notes
- the order in which the job are executed on the grid is not necessarily the same as the order in which they are submitted.
 - the grid job ids, starting with https://, look like URLs and that's exactly what they are. The user who submits the job can view the status of that job using a webbrowser, provided that the user's grid certificate is installed in that browser.
 
See the section DANS Job Scripts for more details on both the 'compress-tar' and the 'job-status' scripts.
Job output
When a grid job is finished the 'job-status' script automatically retrieves the output:
00154 https://wms1.grid.sara.nl:9000/LlxU9gTVTMvxi2sSjsqr9g Status=Done (Exit code=0) Retrieving job output into $HOME/dans/gridjobs/00154/output
The status message 'Done (Exit code=0)' means that the job ran successfully and returned an exit code 0, which indicates success. In the directory '$HOME/dans/gridjobs/00154/output' there are two files, an empty file 'stderr' and the job's output file 'stdout':
2012/11/13-14:47:45 Job start: [soundbites 76 84] Retrieving file lfn://grid/dans/soundbites/soundbites-0076.tar Storing file lfn://grid/dans/soundbites/soundbites-0076.tar.gz guid:f5c00083-554a-48b5-b4c5-645b68b75402 [...] Retrieving file lfn://grid/dans/soundbites/soundbites-0084.tar Storing file lfn://grid/dans/soundbites/soundbites-0084.tar.gz guid:3d90fd6c-0db5-4e65-b769-1aefaccac9dc 2012/11/13-16:21:48 Job end
Important notes
- The tarballs need to be compressed only once. It is done on the grid because it is much faster to do it that way, instead of on the DANS dataserver.
 - After the .tar.gz files have been created and have been verified (see Phase 3: Verifying data for more details) the original tarballs need to be deleted from the grid storage. This is not done automatically. An effective commandline-set to delete all files named '.tar' from a single directory on the LFC is
 
$ lfcpath=/grid/dans/$ARCHIVE $ lfc-ls $lfcpath | grep ".tar$" | sed 's|^|lfn:$lfcpath/|' > tarball-list $ lcg-del -a -f tarball-list
Note the use of the '|' character in the 'sed' command as the separator character, instead of the usual '/' character. This way there is no need to escape the slashes in the LFC path variable. Also note the extra slash at the end of the lfcpath variable.
The 'lcg-del' command can take quite some time to complete.