Difference between revisions of "FileStager"
(42 intermediate revisions by the same user not shown) | |||
Line 14: | Line 14: | ||
*A valid [http://certificate.nikhef.nl/ grid certificate] | *A valid [http://certificate.nikhef.nl/ grid certificate] | ||
+ | *Correct setup of the LHCb software environment: | ||
+ | source /project/bfys/lhcb/sw/LbLogin.sh | ||
*Grid environment setup correctly: | *Grid environment setup correctly: | ||
source /global/ices/lcg/glite3.2.6/external/etc/profile.d/grid-env.sh | source /global/ices/lcg/glite3.2.6/external/etc/profile.d/grid-env.sh | ||
+ | (check if the environment variable LCG_GFAL_INFOSYS is correctly set after this) | ||
== Setting up the File Stager == | == Setting up the File Stager == | ||
− | The latest version of the FileStager package is available on a git repository under <span style="color:#7F4E52;">/project/bfys/dremensk | + | The latest version of the FileStager package is available on a git repository under <span style="color:#7F4E52;">/project/bfys/dremensk/FileStager</span> |
To pick it up in your local project directory, in a subdirectory FileStager, use: | To pick it up in your local project directory, in a subdirectory FileStager, use: | ||
− | git clone /project/bfys/dremensk | + | git clone /project/bfys/dremensk/FileStager FileStager |
Then go to FileStager/cmt and type: | Then go to FileStager/cmt and type: | ||
− | + | cmt make | |
− | |||
− | This will compile the package. | + | This will compile the package. There should be no errors under normal circumstances. |
+ | |||
+ | You need a working grid certificate to use the FileStager. To activate a grid proxy certificate, do: | ||
+ | |||
+ | voms-proxy-init -voms lhcb -out $HOME/.globus/gridproxy.cert | ||
+ | export X509_USER_PROXY=${HOME}/.globus/gridproxy.cert | ||
+ | |||
+ | This way, the certificate will NOT be stored in the local /tmp directory, but instead in a location reachable from Stoomboot. | ||
====Running interactive jobs with GaudiPython==== | ====Running interactive jobs with GaudiPython==== | ||
+ | To start an interactive job on Stoomboot, use a command similar to: | ||
+ | qsub -q stbcq -I | ||
− | Save the following lines of code in a file (FileStager.py), or better yet, copy it from <span style="color:#7F4E52;">[http://www.nikhef.nl/pub/projects/grid/gridwiki/images/ | + | This will start an interactive job on the (currently available) ''stbcq'' queue. To see the available queues and their wall-times, use |
+ | qstat -Q | ||
+ | |||
+ | Save the following lines of code in a file (FileStager.py), or better yet, copy it from <span style="color:#7F4E52;">[http://www.nikhef.nl/pub/projects/grid/gridwiki/images/8/85/FileStager.example2 here]</span> | ||
<pre> | <pre> | ||
from Gaudi.Configuration import * | from Gaudi.Configuration import * | ||
+ | from Configurables import FileStagerSvc | ||
if 'Gaudi:IODataManager/IODataManager'in ApplicationMgr().ExtSvc : | if 'Gaudi:IODataManager/IODataManager'in ApplicationMgr().ExtSvc : | ||
ApplicationMgr().ExtSvc.remove('Gaudi:IODataManager/IODataManager') | ApplicationMgr().ExtSvc.remove('Gaudi:IODataManager/IODataManager') | ||
Line 51: | Line 66: | ||
and add the following line at the end of your python job options file: | and add the following line at the end of your python job options file: | ||
− | importOptions('FileStager.py') | + | importOptions('/directory/location/to/FileStager.py') |
Make sure that you've followed the [[#Requirements for using the File Stager]]. | Make sure that you've followed the [[#Requirements for using the File Stager]]. | ||
Then simply use: <span style="color:#7F4E52;">gaudirun.py YourOptionsFile.py</span> | Then simply use: <span style="color:#7F4E52;">gaudirun.py YourOptionsFile.py</span> | ||
+ | ====Running batch jobs ==== | ||
+ | |||
+ | Running batch jobs requires preparing a shell script to submit to the batch system on Stoomboot. Your script may contain commands to call scripts to setup the LHCb software correctly, or additional environment variables, depending on your goals. In addition, use the following lines in the script to restart the grid certificate on the Stoomboot worker node. | ||
+ | export X509_USER_PROXY=${HOME}/.globus/gridproxy.cert | ||
+ | voms-proxy-init -voms lhcb -noregen | ||
+ | |||
+ | At the end of your script, you will probably have a line similar to the following: | ||
+ | gaudirun.py YourOptionsFile.py | ||
+ | It is often the case that you want to run the same job using different input stream specifications. Use the <span style="color:#7F4E52;">PBS_ARRAYID</span> environmental variable set by Torque (the current resource manager on Stoomboot) for this purpose. For example, if you have 3 different job-option files: | ||
+ | |||
+ | CERN_LFN_01.py | ||
+ | CERN_LFN_02.py | ||
+ | CERN_LFN_03.py | ||
+ | |||
+ | you can use a command in a similar manner as below: | ||
+ | |||
+ | qsub -q stbcq -V -t 1-3 yourJobSrcipt.sh | ||
+ | |||
+ | where <span style="color:#7F4E52;">yourJobScript.sh</span> will have a line: | ||
+ | |||
+ | gaudirun.py jobDescription.py CERN_LFN_0$PBS_ARRAYID.py | ||
+ | |||
+ | This will submit 3 separate jobs with the same job description but different input stream specifications. | ||
+ | |||
+ | ''tip'':To pass all environment variables of the submitting shell to the batch job (with exception of $PATH), add the option -V to qsub command. | ||
+ | |||
====Using the FileStager in Ganga==== | ====Using the FileStager in Ganga==== | ||
+ | Using the File Stager in Ganga is straightforward and does not require significant changes in your original Ganga script. Make sure you provide YourOptionsFile.py in a manner similar to this: | ||
+ | |||
+ | dv = DaVinci( version = 'v25r6', user_release_area='/path/to/local/DaVinciProject/containing/FileStager' ) | ||
+ | dv.optsfile = "DaVinci-Staging.py" | ||
+ | |||
+ | If your input data list is provided in a separate file, add it to the job's inputdata field, like the example below: | ||
+ | |||
+ | ds = dv.readInputData('/project/bfys/dremensk/DaVinci/100_CERN_01_03.py') | ||
+ | j.inputdata = ds | ||
+ | |||
+ | To make sure that the proper LFN-> PFN translation is made, an XML catalogue containing the mappings should be provided, similar to this: | ||
+ | |||
+ | j.inputdata.XMLCatalogueSlice = File('/project/bfys/dremensk/DaVinci/100_CERN_01.xml') | ||
+ | |||
+ | If you use Event Tag Collections, you need to add the following: | ||
+ | |||
+ | j.inputdata.depth=2 | ||
+ | |||
+ | An example of a Ganga script for submission is available <span style="color:#7F4E52;">[http://www.nikhef.nl/pub/projects/grid/gridwiki/images/a/af/GangaSubmitter.example2 here]</span> | ||
=== Configuration Settings === | === Configuration Settings === | ||
− | By default, the input file prefix is gfal. This goes for input files specified e.x. : | + | By default, the input file prefix is "gfal:". This goes for input files specified e.x. : |
gfal:lfn:/grid/lhcb/production/DC06/phys-v3-lumi2/00001857/RDST/0000/00001857_00000024_1.rdst | gfal:lfn:/grid/lhcb/production/DC06/phys-v3-lumi2/00001857/RDST/0000/00001857_00000024_1.rdst | ||
You can change it to a different prefix (depending on your input file specification) in your job options file by using: | You can change it to a different prefix (depending on your input file specification) in your job options file by using: | ||
Line 69: | Line 129: | ||
'''Attention''': | '''Attention''': | ||
− | * For LFN handles, '''do not''' set the InfilePrefix to "LFN:". Only if there is an additional prefix besides the default protocol, like for example, in: | + | * For LFN handles (when no ''gfal'' is used), '''do not''' set the InfilePrefix to "LFN:". Only if there is an additional prefix besides the default protocol, like for example, in: |
PFN:srm://tbn18.nikhef.nl/dpm/nikhef.nl/home/lhcb/RDST/0000/00001857_00000024_1.rdst | PFN:srm://tbn18.nikhef.nl/dpm/nikhef.nl/home/lhcb/RDST/0000/00001857_00000024_1.rdst | ||
− | set the prefix to "PFN:". This is the prefix to-be-removed by the stager before calling lcg-cp. | + | set the prefix to "PFN:". This is the prefix to-be-removed by the stager before calling lcg-cp. For ''Event Tag Collections'', please keep the prefix to "gfal:". |
Line 83: | Line 143: | ||
By default, one file is staged upfront. If the network is fast and reliable, you might save more time by staging multiple files upfront, while processing a single file. This can be changed with the command: | By default, one file is staged upfront. If the network is fast and reliable, you might save more time by staging multiple files upfront, while processing a single file. This can be changed with the command: | ||
− | FileStagerSvc(). | + | FileStagerSvc().PipeSize = 2 |
This means that two files are staged in parallel with every opening of a file to process. | This means that two files are staged in parallel with every opening of a file to process. | ||
Line 116: | Line 176: | ||
Default tmpdir set to <directory> | Default tmpdir set to <directory> | ||
− | The temporary directory is where the staged Grid files are stored, in the course of the job execution. It is in the format BaseTmpdir/<username>_pIDXXXX | + | The '''temporary directory''' is where the staged Grid files are stored, in the course of the job execution. It is in the format BaseTmpdir/<username>_pIDXXXX |
By default, the following sequence of locations is suggested for the BaseTmpdir: | By default, the following sequence of locations is suggested for the BaseTmpdir: | ||
Line 123: | Line 183: | ||
1.1 If any of the following environment variables are set, the first available one will be used: TMPDIR, EDG_WL_SCRATCH,OSG_WN_TMP,WORKDIR | 1.1 If any of the following environment variables are set, the first available one will be used: TMPDIR, EDG_WL_SCRATCH,OSG_WN_TMP,WORKDIR | ||
Else | Else | ||
− | 1.2 /tmp directory is used (be careful, as it is normally just a small partition with a few Gigabytes of | + | 1.2 /tmp directory is used (be careful, as it is normally just a small partition with less than a few Gigabytes of disk space) |
− | You can override the default location of the BaseTmpdir by including the following line in the | + | You can override the default location of the BaseTmpdir by including the following line in the jobOptions file |
FileStagerSvc().BaseTmpdir="/some/temp/directory/with/sufficient/space" | FileStagerSvc().BaseTmpdir="/some/temp/directory/with/sufficient/space" | ||
Line 136: | Line 196: | ||
FATAL Error with lcg_cp utility! | FATAL Error with lcg_cp utility! | ||
− | ERROR Error message: file:/ | + | ERROR Error message: file:/nfs/path/to/file: No such file or directory |
If there is no sufficient local disk space on the worker node where the job is running, the stager attempts to use a shared disk space to store the temporary files. By default, the fallback directory is the NFS partition <span style="color:#7F4E52;">/project/bfys/<username></span> where <username> is picked-up from the $USER environment variable of the user running the analysis job. If this subdirectory doesn't exist, or there are no write privileges, it will fail to stage the file. You can change the fallback directory to a different path: | If there is no sufficient local disk space on the worker node where the job is running, the stager attempts to use a shared disk space to store the temporary files. By default, the fallback directory is the NFS partition <span style="color:#7F4E52;">/project/bfys/<username></span> where <username> is picked-up from the $USER environment variable of the user running the analysis job. If this subdirectory doesn't exist, or there are no write privileges, it will fail to stage the file. You can change the fallback directory to a different path: | ||
Line 148: | Line 208: | ||
==Useful links== | ==Useful links== | ||
− | [http://www.nikhef.nl/~danielar/FileStager/doc/code/classes.html | + | File Stager [http://www.nikhef.nl/~danielar/FileStager/doc/code/classes.html doxygen] |
[https://twiki.cern.ch/twiki/bin/view/Main/FileStager Atlas FileStager ] | [https://twiki.cern.ch/twiki/bin/view/Main/FileStager Atlas FileStager ] | ||
[http://ploeg.nikhef.nl/ganglia/?c=Stoomboot Stoomboot performance monitoring ] | [http://ploeg.nikhef.nl/ganglia/?c=Stoomboot Stoomboot performance monitoring ] |
Latest revision as of 19:22, 11 September 2010
Introduction to the FileStager package
The File Stager is created to solve the problem of the slow LHCb analysis jobs run on an offline batch cluster, such as Stoomboot. It is based on the idea of caching the input files on a closer location, to reduce the latency of the remote access (typically using rfio or dcap protocol) to these files by the analysis jobs. It works in a streamlined fashion: processing of the event data inside a "staged" local file happens concurrently (in parallel, overlapped) with staging the next input file from the Grid. Copying of a file should start at the same time that the previous file is opened for processing. For example: file_2 is staged while file_1 is processed, file_3 is staged while file_2 is processed an so on. Under the hoods, lcg-cp is used to copy the files from the Grid storage, which is typically a wrapper (LCG_Utils) around a gridcopy middleware command.
Under normal circumstances (network not saturated, event processing time ~10ms, file sizes ~2GB) staging of a file finishes earlier than processing, so the stager effectively performs like having the files on a local disk. If staging of a file is not finished before it is necessary for processing, the FileStager will block until the transfer is completed. The only imposed waiting time is during staging of the very first file.
A staged file is kept on the local storage only for the time required by the job to process that file, after which it is automatically removed. In case the job fails/crashes, any orphan staged files associated with a job will be removed by a separate independent garbage-collection process.
The stager also works with Event Tag Collections, in which case the .root file is scanned to obtain the original Grid files that contain the event data.
The input files specified in the job options should be picked up by the FileStager package and translated to local URLs which the EventSelector will use for the job.
Requirements for using the File Stager
- A valid grid certificate
- Correct setup of the LHCb software environment:
source /project/bfys/lhcb/sw/LbLogin.sh
- Grid environment setup correctly:
source /global/ices/lcg/glite3.2.6/external/etc/profile.d/grid-env.sh
(check if the environment variable LCG_GFAL_INFOSYS is correctly set after this)
Setting up the File Stager
The latest version of the FileStager package is available on a git repository under /project/bfys/dremensk/FileStager
To pick it up in your local project directory, in a subdirectory FileStager, use:
git clone /project/bfys/dremensk/FileStager FileStager
Then go to FileStager/cmt and type:
cmt make
This will compile the package. There should be no errors under normal circumstances.
You need a working grid certificate to use the FileStager. To activate a grid proxy certificate, do:
voms-proxy-init -voms lhcb -out $HOME/.globus/gridproxy.cert export X509_USER_PROXY=${HOME}/.globus/gridproxy.cert
This way, the certificate will NOT be stored in the local /tmp directory, but instead in a location reachable from Stoomboot.
Running interactive jobs with GaudiPython
To start an interactive job on Stoomboot, use a command similar to:
qsub -q stbcq -I
This will start an interactive job on the (currently available) stbcq queue. To see the available queues and their wall-times, use
qstat -Q
Save the following lines of code in a file (FileStager.py), or better yet, copy it from here
from Gaudi.Configuration import * from Configurables import FileStagerSvc if 'Gaudi:IODataManager/IODataManager'in ApplicationMgr().ExtSvc : ApplicationMgr().ExtSvc.remove('Gaudi:IODataManager/IODataManager') ApplicationMgr().ExtSvc += [ "EventSelector", "Gaudi::StagedIODataManager/IODataManager",'FileStagerSvc'] EventSelector().StreamManager = "StagedDataStreamTool"
and add the following line at the end of your python job options file:
importOptions('/directory/location/to/FileStager.py')
Make sure that you've followed the #Requirements for using the File Stager. Then simply use: gaudirun.py YourOptionsFile.py
Running batch jobs
Running batch jobs requires preparing a shell script to submit to the batch system on Stoomboot. Your script may contain commands to call scripts to setup the LHCb software correctly, or additional environment variables, depending on your goals. In addition, use the following lines in the script to restart the grid certificate on the Stoomboot worker node.
export X509_USER_PROXY=${HOME}/.globus/gridproxy.cert voms-proxy-init -voms lhcb -noregen
At the end of your script, you will probably have a line similar to the following:
gaudirun.py YourOptionsFile.py
It is often the case that you want to run the same job using different input stream specifications. Use the PBS_ARRAYID environmental variable set by Torque (the current resource manager on Stoomboot) for this purpose. For example, if you have 3 different job-option files:
CERN_LFN_01.py CERN_LFN_02.py CERN_LFN_03.py
you can use a command in a similar manner as below:
qsub -q stbcq -V -t 1-3 yourJobSrcipt.sh
where yourJobScript.sh will have a line:
gaudirun.py jobDescription.py CERN_LFN_0$PBS_ARRAYID.py
This will submit 3 separate jobs with the same job description but different input stream specifications.
tip:To pass all environment variables of the submitting shell to the batch job (with exception of $PATH), add the option -V to qsub command.
Using the FileStager in Ganga
Using the File Stager in Ganga is straightforward and does not require significant changes in your original Ganga script. Make sure you provide YourOptionsFile.py in a manner similar to this:
dv = DaVinci( version = 'v25r6', user_release_area='/path/to/local/DaVinciProject/containing/FileStager' ) dv.optsfile = "DaVinci-Staging.py"
If your input data list is provided in a separate file, add it to the job's inputdata field, like the example below:
ds = dv.readInputData('/project/bfys/dremensk/DaVinci/100_CERN_01_03.py') j.inputdata = ds
To make sure that the proper LFN-> PFN translation is made, an XML catalogue containing the mappings should be provided, similar to this:
j.inputdata.XMLCatalogueSlice = File('/project/bfys/dremensk/DaVinci/100_CERN_01.xml')
If you use Event Tag Collections, you need to add the following:
j.inputdata.depth=2
An example of a Ganga script for submission is available here
Configuration Settings
By default, the input file prefix is "gfal:". This goes for input files specified e.x. :
gfal:lfn:/grid/lhcb/production/DC06/phys-v3-lumi2/00001857/RDST/0000/00001857_00000024_1.rdst
You can change it to a different prefix (depending on your input file specification) in your job options file by using:
FileStagerSvc().InfilePrefix="PFN:"
Attention:
- For LFN handles (when no gfal is used), do not set the InfilePrefix to "LFN:". Only if there is an additional prefix besides the default protocol, like for example, in:
PFN:srm://tbn18.nikhef.nl/dpm/nikhef.nl/home/lhcb/RDST/0000/00001857_00000024_1.rdst
set the prefix to "PFN:". This is the prefix to-be-removed by the stager before calling lcg-cp. For Event Tag Collections, please keep the prefix to "gfal:".
If you want to have a more verbose output of the FileStager work, use the provided Gaudi verbosity levels(1-5) in the same way (smaller number=more details):
FileStagerSvc().OutputLevel = 2
By default, one file is staged upfront. If the network is fast and reliable, you might save more time by staging multiple files upfront, while processing a single file. This can be changed with the command:
FileStagerSvc().PipeSize = 2
This means that two files are staged in parallel with every opening of a file to process.
Troubleshooting
Error with lcg_cp utility!Error message: Some error message from the Lcg_util middleware Checking file size:File does not exist, or problems with gfal library.
This is usually indication that no valid certificate exists, or that the grid environment is not setup correctly. Try:
echo $LCG_LOCATION
to see if this environment variable is set correctly. It should point to some directory like
/global/ices/lcg/glite3.2.4/lcg
Also try:
lcg-cp -v --vo lhcb srm://aFiletoStage file:/tmp/localFileCopy
If this isn't working correctly, the FileStager can't do much about it: ask a grid expert to solve the problem. It may be a problem with the Storage Element, your certificate, the network etc.
Checking file size for: LFN:/lhcb/data/2009/DST/00005727/0000/00005727_00000054_1.dst
ERROR Checking file size:File does not exist, or problems with gfal library
DEBUG Can't use local disk space. /*Switching to replicating...*/
This is again an indication that the stager couldn't get information about the remote file (gfal_stat, a POSIX call similar to stat). It may be that the file is not on the storage, or you don't have privileges for accessing it (proxy certificate, VO...). Try the same lcg-cp command as before.
No permission to write in temporary directory <somedirectory>. Switching back to default <$TMPDIR>, or else </tmp>. Default tmpdir set to <directory>
The temporary directory is where the staged Grid files are stored, in the course of the job execution. It is in the format BaseTmpdir/<username>_pIDXXXX By default, the following sequence of locations is suggested for the BaseTmpdir:
1. If the BaseTmpdir property of the FileStagerSvc is set (see #Configuration Settings) ,AND the job has permission to write in that directory, then this location will be used. Else 1.1 If any of the following environment variables are set, the first available one will be used: TMPDIR, EDG_WL_SCRATCH,OSG_WN_TMP,WORKDIR Else 1.2 /tmp directory is used (be careful, as it is normally just a small partition with less than a few Gigabytes of disk space)
You can override the default location of the BaseTmpdir by including the following line in the jobOptions file
FileStagerSvc().BaseTmpdir="/some/temp/directory/with/sufficient/space"
or by explicitly setting the environment variable TMPDIR to a location on Stoomboot:
export TMPDIR=/tmpdir
Make sure that the directory has sufficient (free) disk space and write privileges.
FATAL Error with lcg_cp utility! ERROR Error message: file:/nfs/path/to/file: No such file or directory
If there is no sufficient local disk space on the worker node where the job is running, the stager attempts to use a shared disk space to store the temporary files. By default, the fallback directory is the NFS partition /project/bfys/<username> where <username> is picked-up from the $USER environment variable of the user running the analysis job. If this subdirectory doesn't exist, or there are no write privileges, it will fail to stage the file. You can change the fallback directory to a different path:
FileStagerSvc().FallbackDir = "/other/directory/accessible/from/Stoomboot"
If you still can't diagnose the problem, set the File Stager to a verbose mode:
FileStagerSvc().OutputLevel = 1
and send the log to Daniela Remenska
Useful links
File Stager doxygen