Difference between revisions of "Dans Data Upload"

From BiGGrid Wiki
Jump to navigation Jump to search
Line 1: Line 1:
 
==How to upload a DANS archive to the grid==
 
==How to upload a DANS archive to the grid==
  
* create a new directory with the name of the archive. As an example we use the 'Crome' archive. We refer to the name of the archive using the environment variable '$ARCHIVE' :  
+
* create a new directory with the name of the archive. As an example we use the 'Crome' archive. We refer to the name of the archive using the environment variable '${ARCHIVE}' :  
 
   $ export ARCHIVE=Crome
 
   $ export ARCHIVE=Crome
   $ mkdir -p ~/dans/$ARCHIVE/
+
   $ mkdir -p ~/dans/${ARCHIVE}
 
* In this directory create another directory with the same name; this directory will contain the list of files and directories that need to be uploaded
 
* In this directory create another directory with the same name; this directory will contain the list of files and directories that need to be uploaded
   $ cd ~/dans/$ARCHIVE
+
   $ cd ~/dans/${ARCHIVE}
   $ mkdir $ARCHIVE
+
   $ mkdir ${ARCHIVE}
 
* copy over the scripts from the repository
 
* copy over the scripts from the repository
 
   $ cp -a ~/dans/scripts/* .
 
   $ cp -a ~/dans/scripts/* .
 
* generate a '''sorted''' list of files. '''Note''': All further actions are done based on this list!
 
* generate a '''sorted''' list of files. '''Note''': All further actions are done based on this list!
   $ find -L $ARCHIVE -type f | sort > ${ARCHIVE}-files.txt
+
   $ find -L ${ARCHIVE} -type f | sort > ${ARCHIVE}-files.txt
 
* check the list of files , remove any entries such as '.Trash' folders if desired.  
 
* check the list of files , remove any entries such as '.Trash' folders if desired.  
* generate a list of tarball.lst files. Each tarball.lst file contains a subset of entries from the ${ARCHIVE}-files.txt file that , when tarred up into a single .tar file, is roughly 8 GB in size. The output files are named '$ARCHIVE-<N>.tar.lst' where <N> is a 4 digit counter starting at 1:
+
* generate a list of tarball.lst files. Each tarball.lst file contains a subset of entries from the ${ARCHIVE}-files.txt file that , when tarred up into a single .tar file, is roughly 8 GB in size. The output files are named '${ARCHIVE}-<N>.tar.lst' where <N> is a 4 digit counter starting at 1:
 
   $ ./gen-tar-list ${ARCHIVE}-files.txt
 
   $ ./gen-tar-list ${ARCHIVE}-files.txt
 
   Crome-0001.tar.lst
 
   Crome-0001.tar.lst
Line 23: Line 23:
 
** generate md5 checksums for all files in each tarball  
 
** generate md5 checksums for all files in each tarball  
 
** upload each tarball to the grid.  
 
** upload each tarball to the grid.  
 +
This script will take a long time to run, depending on how many tarballs there are.
 
'''Note''': For this step a valid grid proxy is required!
 
'''Note''': For this step a valid grid proxy is required!
 
   $ ./upload-tar
 
   $ ./upload-tar
 
   Checksumming tarball contents
 
   Checksumming tarball contents
 +
  Generating ${ARCHIVE}-0001.tar
 +
  ...
 +
 +
For each tarball that is successfully processed the '${ARCHIVE}-<N>.tar.lst' file is moved to a separate directory 'done' . This way the 'upload-tar' script can be stopped and restarted at will, as it will continue processing '${ARCHIVE}-<N>.tar.lst' files until all have been moved to the 'done' directory.

Revision as of 12:54, 10 May 2012

How to upload a DANS archive to the grid

  • create a new directory with the name of the archive. As an example we use the 'Crome' archive. We refer to the name of the archive using the environment variable '${ARCHIVE}' :
 $ export ARCHIVE=Crome
 $ mkdir -p ~/dans/${ARCHIVE}
  • In this directory create another directory with the same name; this directory will contain the list of files and directories that need to be uploaded
 $ cd ~/dans/${ARCHIVE}
 $ mkdir ${ARCHIVE}
  • copy over the scripts from the repository
 $ cp -a ~/dans/scripts/* .
  • generate a sorted list of files. Note: All further actions are done based on this list!
 $ find -L ${ARCHIVE} -type f | sort > ${ARCHIVE}-files.txt
  • check the list of files , remove any entries such as '.Trash' folders if desired.
  • generate a list of tarball.lst files. Each tarball.lst file contains a subset of entries from the ${ARCHIVE}-files.txt file that , when tarred up into a single .tar file, is roughly 8 GB in size. The output files are named '${ARCHIVE}-<N>.tar.lst' where <N> is a 4 digit counter starting at 1:
 $ ./gen-tar-list ${ARCHIVE}-files.txt
 Crome-0001.tar.lst
 Crome-0002.tar.lst
 ...
 Crome-0072.tar.lst
  • Now the final big step: run the 'upload-tar' script, which will
    • generate the tarballs
    • generate md5 checksums for all files in each tarball
    • upload each tarball to the grid.

This script will take a long time to run, depending on how many tarballs there are. Note: For this step a valid grid proxy is required!

 $ ./upload-tar
 Checksumming tarball contents
 Generating ${ARCHIVE}-0001.tar
 ...

For each tarball that is successfully processed the '${ARCHIVE}-<N>.tar.lst' file is moved to a separate directory 'done' . This way the 'upload-tar' script can be stopped and restarted at will, as it will continue processing '${ARCHIVE}-<N>.tar.lst' files until all have been moved to the 'done' directory.