Difference between revisions of "Dans Data Upload"

From BiGGrid Wiki
Jump to navigation Jump to search
 
(18 intermediate revisions by the same user not shown)
Line 1: Line 1:
 
==How to upload a DANS archive to the grid==
 
==How to upload a DANS archive to the grid==
  
* create a new directory with the name of the archive. As an example we use the 'Crome' archive. We refer to the name of the archive using the environment variable '$ARCHIVE' :  
+
'''Note''': as a convention, commands that need to be typed in are preceded by a UNIX prompt-sign '$' . Output if shown without a preceding prompt sign.
   export ARCHIVE=Crome
+
 
   mkdir -p ~/dans/$ARCHIVE/
+
* create a new directory with the name of the archive. As an example we use the 'Crome' archive. We refer to the name of the archive using the environment variable '${ARCHIVE}' :  
 +
   $ export ARCHIVE=Crome
 +
   $ mkdir -p ~/dans/${ARCHIVE}
 +
* copy over the scripts from the repository
 +
  $ cd ~/dans/${ARCHIVE}
 +
  $ cp -a ~/dans/scripts/* .
 
* In this directory create another directory with the same name; this directory will contain the list of files and directories that need to be uploaded
 
* In this directory create another directory with the same name; this directory will contain the list of files and directories that need to be uploaded
   cd ~/dans/$ARCHIVE
+
   $ mkdir ${ARCHIVE}
  mkdir $ARCHIVE
+
* in this second ${ARCHIVE} directory, create a list of all files and directories that need to be uploaded. If an entire directory needs to be uploaded then create a symlink to it. For the 'Crome' archive the following command was used to create the symlinks:
* copy over the scripts from the repository
+
   $ for i in <fullpath>/Crome/crome*/ ; do echo ln -fs $i ; done
   cp -a ~/dans/scripts/* .
 
 
* generate a '''sorted''' list of files. '''Note''': All further actions are done based on this list!
 
* generate a '''sorted''' list of files. '''Note''': All further actions are done based on this list!
   find -L $ARCHIVE -type f | sort > ${ARCHIVE}-files.txt
+
   $ find -L ${ARCHIVE} -type f | sort > ${ARCHIVE}-files.txt
 
* check the list of files , remove any entries such as '.Trash' folders if desired.  
 
* check the list of files , remove any entries such as '.Trash' folders if desired.  
* generate a list of tarball.lst files. Each tarball.lst file contains a subset of entries from the ${ARCHIVE}-files.txt file that , when tarred up into a single .tar file, is roughly 8 GB in size.
+
* Use the <tt>gen-tar-list</tt> script to generate a list of tarball.lst files. Each tarball.lst file contains a subset of entries from the ${ARCHIVE}-files.txt file that , when tarred up into a single .tar file, is at least 8 GB in size. The output files are named '${ARCHIVE}-<N>.tar.lst' where <N> is a 4 digit counter starting at 1:
   ./gen-tar-list ${ARCHIVE}-files.txt
+
   $ ./gen-tar-list ${ARCHIVE}-files.txt
The output files are named '$ARCHIVE-<N>.tar.lst' where <N> is a 4 digit counter starting at 1
+
  Crome-0001.tar.lst
 +
  Crome-0002.tar.lst
 +
  ...
 +
  Crome-0073.tar.lst
 +
 
 +
* Now the final big step: run the '<tt>upload-tar</tt>' script, which will
 +
** generate the tarballs
 +
** generate md5 checksums for all files in each tarball
 +
** upload each tarball to the grid.
 +
This script will take a long time to run, depending on how many tarballs there are. It is often useful to run the next command inside a <tt>screen</tt> session.
 +
 
 +
'''Note''': For this step a valid grid proxy is required!
 +
  $ ./upload-tar
 +
  Checksumming tarball contents
 +
  Generating ${ARCHIVE}-0001.tar
 +
  Uploading ${ARCHIVE}-0001.tar
 +
  guid:86f11fb4-9b57-4fae-a787-8019663e248c
 +
  Moving ${ARCHIVE}-0001.tar.lst and ${ARCHIVE}-0001.tar.md5sum to directory "done"
 +
  Checksumming tarball contents
 +
  Generating ${ARCHIVE}-0002.tar
 +
  Uploading ${ARCHIVE}-0002.tar
 +
  ...
 +
  Generating ${ARCHIVE}-0073.tar
 +
  Uploading ${ARCHIVE}-0073.tar
 +
  guid:e6781de9-9709-4bf6-a019-b5fa4a1fb3c8
 +
  Moving ${ARCHIVE}-0073.tar.lst and ${ARCHIVE}-0073.tar.md5sum to directory "done"
 +
 
 +
For each tarball that is successfully processed the '${ARCHIVE}-<N>.tar.lst' file is moved to a separate directory 'done' . This way the '<tt>upload-tar</tt>' script can be stopped and restarted at will, as it will continue processing '${ARCHIVE}-<N>.tar.lst' files until all have been moved to the 'done' directory.
 +
 
 +
For more details on the usage of the '<tt>upload-tar</tt>' script , see [[DANS_Job_Scripts|DANS Job Scripts]].
 +
 
 +
* Check the contents of the 'done' directory , especially the contents of the '${ARCHIVE}-<N>.tar.md5sum' files:
 +
  $ cat done/${ARCHIVE}-0001.tar.md5sum
 +
  # Crome-0001.tar START
 +
  939131fac0d40184b5681e18f7b9856c  Crome/crome_0045_zeljko_obradovic/crome_0045_zeljko_obradovic_01/407_0001_01.MP4
 +
  03a1c56e97923f083bd554981a555a0f  Crome/crome_0045_zeljko_obradovic/crome_0045_zeljko_obradovic_01/407_0001_01.SMI
 +
  462e8ebbfe70127e46a9b447a14706f4  Crome/crome_0045_zeljko_obradovic/crome_0045_zeljko_obradovic_01/407_0001_01I01.PPN
 +
  581c66b2da8852111d449300926c1d52  Crome/crome_0045_zeljko_obradovic/crome_0045_zeljko_obradovic_01/407_0001_01M01.XML
 +
  9d5c4d62b4d410e4e1f44372cefe2132  Crome/crome_0045_zeljko_obradovic/crome_0045_zeljko_obradovic_01/407_0001_01R01.BIM
 +
  b0ca419552a5bb9004ff1849c7edbb3e  Crome/crome_0045_zeljko_obradovic/crome_0045_zeljko_obradovic_02/407_0002_01.MP4
 +
  68c36464994cebe84df5ff2b38320a32  Crome/crome_0045_zeljko_obradovic/crome_0045_zeljko_obradovic_02/407_0002_01.SMI
 +
  6d72650cd3cfd3481a97af6d1aacfef7  Crome/crome_0045_zeljko_obradovic/crome_0045_zeljko_obradovic_02/407_0002_01I01.PPN
 +
  1c1be7c0bd0d11302536663079c0fd69  Crome/crome_0045_zeljko_obradovic/crome_0045_zeljko_obradovic_02/407_0002_01M01.XML
 +
  0924e3973a329a0824a400d8ac78d0b8  Crome/crome_0045_zeljko_obradovic/crome_0045_zeljko_obradovic_02/407_0002_01R01.BIM
 +
  d1d08d0818ffe4c5ae3ebb0d0ea349a0  Crome/crome_0045_zeljko_obradovic/crome_0045_zeljko_obradovic_03/407_0002_02.MP4
 +
  # Crome-0001.tar END
 +
 
 +
By combining all '${ARCHIVE}-0001.tar.md5sum' files a full list of all md5sums can be generated and compared to the output of an '<tt>md5deep</tt>' command:
 +
  $ for i in done/${ARCHIVE}-*.tar.md5sum ; do grep -v '#' $i ; done | sort > total.md5sum
 +
 
 +
* After verifying that all '${ARCHIVE}-0001.tar.md5sum' files are correct, move them (or symlink them) to another directory '<tt>checksums</tt>'. The latter directory is used by the '<tt>compare-checksums</tt>' script from [[DANS_Data_Management#DANS_Workflow_phase_3:_Verifying_data|Phase 3: Verifying data]].
 +
  $ mkdir checksums
 +
  $ cp done/*.tar.md5sum checksums
 +
 
 +
* Once the upload has completed successfully you can proceed to [[DANS_Data_Management#DANS_Workflow_phase_2:_Compressing_data|Phase2: Compressing Data]].

Latest revision as of 11:26, 20 November 2012

How to upload a DANS archive to the grid

Note: as a convention, commands that need to be typed in are preceded by a UNIX prompt-sign '$' . Output if shown without a preceding prompt sign.

  • create a new directory with the name of the archive. As an example we use the 'Crome' archive. We refer to the name of the archive using the environment variable '${ARCHIVE}' :
 $ export ARCHIVE=Crome
 $ mkdir -p ~/dans/${ARCHIVE}
  • copy over the scripts from the repository
 $ cd ~/dans/${ARCHIVE}
 $ cp -a ~/dans/scripts/* .
  • In this directory create another directory with the same name; this directory will contain the list of files and directories that need to be uploaded
 $ mkdir ${ARCHIVE}
  • in this second ${ARCHIVE} directory, create a list of all files and directories that need to be uploaded. If an entire directory needs to be uploaded then create a symlink to it. For the 'Crome' archive the following command was used to create the symlinks:
 $ for i in <fullpath>/Crome/crome*/ ; do echo ln -fs $i ; done
  • generate a sorted list of files. Note: All further actions are done based on this list!
 $ find -L ${ARCHIVE} -type f | sort > ${ARCHIVE}-files.txt
  • check the list of files , remove any entries such as '.Trash' folders if desired.
  • Use the gen-tar-list script to generate a list of tarball.lst files. Each tarball.lst file contains a subset of entries from the ${ARCHIVE}-files.txt file that , when tarred up into a single .tar file, is at least 8 GB in size. The output files are named '${ARCHIVE}-<N>.tar.lst' where <N> is a 4 digit counter starting at 1:
 $ ./gen-tar-list ${ARCHIVE}-files.txt
 Crome-0001.tar.lst
 Crome-0002.tar.lst
 ...
 Crome-0073.tar.lst
  • Now the final big step: run the 'upload-tar' script, which will
    • generate the tarballs
    • generate md5 checksums for all files in each tarball
    • upload each tarball to the grid.

This script will take a long time to run, depending on how many tarballs there are. It is often useful to run the next command inside a screen session.

Note: For this step a valid grid proxy is required!

 $ ./upload-tar
 Checksumming tarball contents
 Generating ${ARCHIVE}-0001.tar
 Uploading ${ARCHIVE}-0001.tar
 guid:86f11fb4-9b57-4fae-a787-8019663e248c
 Moving ${ARCHIVE}-0001.tar.lst and ${ARCHIVE}-0001.tar.md5sum to directory "done"
 Checksumming tarball contents
 Generating ${ARCHIVE}-0002.tar
 Uploading ${ARCHIVE}-0002.tar
 ...
 Generating ${ARCHIVE}-0073.tar
 Uploading ${ARCHIVE}-0073.tar
 guid:e6781de9-9709-4bf6-a019-b5fa4a1fb3c8
 Moving ${ARCHIVE}-0073.tar.lst and ${ARCHIVE}-0073.tar.md5sum to directory "done"

For each tarball that is successfully processed the '${ARCHIVE}-<N>.tar.lst' file is moved to a separate directory 'done' . This way the 'upload-tar' script can be stopped and restarted at will, as it will continue processing '${ARCHIVE}-<N>.tar.lst' files until all have been moved to the 'done' directory.

For more details on the usage of the 'upload-tar' script , see DANS Job Scripts.

  • Check the contents of the 'done' directory , especially the contents of the '${ARCHIVE}-<N>.tar.md5sum' files:
 $ cat done/${ARCHIVE}-0001.tar.md5sum
 # Crome-0001.tar START
 939131fac0d40184b5681e18f7b9856c  Crome/crome_0045_zeljko_obradovic/crome_0045_zeljko_obradovic_01/407_0001_01.MP4
 03a1c56e97923f083bd554981a555a0f  Crome/crome_0045_zeljko_obradovic/crome_0045_zeljko_obradovic_01/407_0001_01.SMI
 462e8ebbfe70127e46a9b447a14706f4  Crome/crome_0045_zeljko_obradovic/crome_0045_zeljko_obradovic_01/407_0001_01I01.PPN
 581c66b2da8852111d449300926c1d52  Crome/crome_0045_zeljko_obradovic/crome_0045_zeljko_obradovic_01/407_0001_01M01.XML
 9d5c4d62b4d410e4e1f44372cefe2132  Crome/crome_0045_zeljko_obradovic/crome_0045_zeljko_obradovic_01/407_0001_01R01.BIM
 b0ca419552a5bb9004ff1849c7edbb3e  Crome/crome_0045_zeljko_obradovic/crome_0045_zeljko_obradovic_02/407_0002_01.MP4
 68c36464994cebe84df5ff2b38320a32  Crome/crome_0045_zeljko_obradovic/crome_0045_zeljko_obradovic_02/407_0002_01.SMI
 6d72650cd3cfd3481a97af6d1aacfef7  Crome/crome_0045_zeljko_obradovic/crome_0045_zeljko_obradovic_02/407_0002_01I01.PPN
 1c1be7c0bd0d11302536663079c0fd69  Crome/crome_0045_zeljko_obradovic/crome_0045_zeljko_obradovic_02/407_0002_01M01.XML
 0924e3973a329a0824a400d8ac78d0b8  Crome/crome_0045_zeljko_obradovic/crome_0045_zeljko_obradovic_02/407_0002_01R01.BIM
 d1d08d0818ffe4c5ae3ebb0d0ea349a0  Crome/crome_0045_zeljko_obradovic/crome_0045_zeljko_obradovic_03/407_0002_02.MP4
 # Crome-0001.tar END

By combining all '${ARCHIVE}-0001.tar.md5sum' files a full list of all md5sums can be generated and compared to the output of an 'md5deep' command:

 $ for i in done/${ARCHIVE}-*.tar.md5sum ; do grep -v '#' $i ; done | sort > total.md5sum
  • After verifying that all '${ARCHIVE}-0001.tar.md5sum' files are correct, move them (or symlink them) to another directory 'checksums'. The latter directory is used by the 'compare-checksums' script from Phase 3: Verifying data.
 $ mkdir checksums
 $ cp done/*.tar.md5sum checksums