CSC Ntuple production

Setting up Grid tools

This wiki describes how to set things up for ntuple production on the Grid. You need a Grid certificate [1] and some patience ;-)

use "voms-proxy-init --voms atlas" instead of "grid-proxy-init" NB: you need this file: ~/.glite/vomses
use bash for shell scripts

Oh, and eh... keep an eye on Ganga, as this is the ultimate ATLAS Distributed Analysis tool. Unfortunately it is still being developed heavily and it has not been so successful for our ntuple production until now. Folkert has been trying it for a while, see the Wiki page: Using ganga at NIKHEF

You also need to set two global variables (tip: put this into your .cshrc or grid-setup script)

setenv LFC_CATALOG_TYPE lfc
setenv LFC_HOST lfc03.nikhef.nl

GridTools

Wouter wrote some very nice tools to submit/retrieve jobs from the Grid. They can be obtained from the Nikhef CVS:

$ cvs -d /project/atlas/cvs co GridTools

The GridTools package contains a few shell scripts:

dmgr: to manage datasets on the Grid
gpm: to set up packages on the SE for Grid jobs (to prevent jobs sizes > input sandbox)
gridmgr: main tool to manage Grid jobs (submit/retrieve/cancel/monitor); NOTE: only run one instance of gridmgr at a time, otherwise the job database gets messed up!
gridinfo: a funny tool (lcg-infosites seems more useful)
jobmgr: to define Grid jobs (with the possibility to run them locally for testing)

dmgr and gpm possibly need to be adjusted:

LFC_HOST = lfc03.nikhef.nl
GRID_SE  = tbn18.nikhef.nl

GridModules

GridModules are packages which will be installed in /grid/atlas/users/${USER}/gpm. These packages reside on the SE (tbn18.nikhef.nl) and can be used for jobs. This prevents submitting (too) large jobs, note that there is a limit on the input sandbox of ~ 20-50 MB.

GridModules are available from CVS:

$ cvs -d /project/atlas/cvs co GridModules

A few examples of modules are present (interesting study material).

make_package: With this tool, a directory is tarred and stored on the SE (using gpm from the GridTools). Note that the file ~/.gpmrc keeps track of which modules have been installed. When running jobs locally, it will use the locally available package instead of the one installed on the SE. Be careful not to include the slash '/' when making the package!

DataManager: This package is used to copy/move/delete and define datasets on the SE. You will definitely need this package.; First you need to make a datasets-directory on you grid space: $ lfc-mkdir /grid/atlas/users/<username>/datasets.; Then change in make_package the directory from where you 'register package on grid' from /home/<username>/gpm to your <working directory>/gpm .; But before doing a make_package DataManager edit DataManager/run to set LFC_HOST and GRID_SE to the desired addresses.

AtlasRelease: To use the Athena software on the the Grid, the desired release has to be set up. It actually contains only one file: setup.sh and it's similar to the setup script you need to do when setting up Athena locally. (A part of this script is actually to setup Athena locally as you can also run/test jobs locally). A few changes have taken place since release 12, my 'AtlasRelease12.0.6/setup.sh':

#!/bin/sh
#
# Setup the environment for ATLAS release 12.0.6
#
# gossie@nikhef.nl
#
LOCAL_AREA=/data/atlas/offline/12.0.6

# --- Clear command line to avoid CMT confusion ---
set -

if [ "$VO_ATLAS_SW_DIR" != "" ] ; then

  # --- Follow the GRID approach ---
  echo "Setting up ATLAS release 12.0.6 from VO_ATLAS_SW_DIR=$VO_ATLAS_SW_DIR"
  . $VO_ATLAS_SW_DIR/software/12.0.6/setup.sh
  . ${SITEROOT}/AtlasOffline/12.0.6/AtlasOfflineRunTime/cmt/setup.sh
  CMTPATH="${PWD}:${CMTPATH}"

elif [ -d $LOCAL_AREA ] ; then

  # --- Follow the local approach ---
  echo "Setting up ATLAS release 12.0.6 from LOCAL_AREA=$LOCAL_AREA"
  . $LOCAL_AREA/setup.sh
  . ${SITEROOT}/AtlasOffline/12.0.6/AtlasOfflineRunTime/cmt/setup.sh
  CMTPATH="${PWD}:${CMTPATH}"

else

  # --- ERROR: Don't know where release is!
  echo "ERROR setting up ATLAS release 12.0.6, cannot find release"
  echo "Release_12.0.6_Not_Found" > errorcode


fi

As for other modules you need to make a directory for AtlasRelease12.0.6: mkdir AtlasRelease12.0.6 and copy the above setup-script to this directory. Make this new package now. ( Don't forget that there is no slash "/" at the end when making the package )

Note that release 12.0.7 is the final release of the 12 series (but it is not installed on all Grid machines yet)

TopView module: Then it's time for an 'Analysis' package (OK, dumping AODs in ntuples is not really 'analysis', though eh... what's the difference?). For CSC ntuple production here at Nikhef, the TopViewAODtoNtuple-00-12-13-03 package can be used. Again you have to make a separate directory for this package, untar the .tgz file in this directory and remove the .tgz-file before making the package.

If you need to package a new TopView version, the steps involved to create the package (for reference):

check the latest version of EventView group area:

http://atlas-computing.web.cern.ch/atlas-computing/links/kitsDirectory/PAT/EventView/

copy the tar-fill to a temporary directory

$ wget http://atlas-computing.web.cern.ch/atlas-computing/links/kitsDirectory/PAT/EventView/EventView-12.0.6.8.tar.gz

strip unnescessary files/directories

only InstallArea and PhysicsAnalysis are needed (note: this will be you 'testarea')

A little complication: if the latest TopView version is not in the package, compile the desired TopView libraries locally and copy them to the InstallArea that will be used in the Grid module.

the InstallArea and PhyicsAnalysis directories should be right in your working directory (not in the subdirectory EVTags-12.0.6.8/ as this subdirectory is not in the $CMTPATH and Athena won't find the libraries then.
put the needed files in the module (eg. TopViewAODtoNtuple-00-12-13-03):

$ cd EVTags-12.0.6.8/

$ tar -cvzf EventView-12.0.6.8_nikhef.tar.gz InstallArea/ PhysicsAnalysis/

$ cp EventView-12.0.6.8_nikhef.tar.gz ${USER}/GridModules/TopViewAODtoNtuple-00-12-13-03

check the run scripts run and AODtoTVNtuple.py (adjust version numbers!) in the TopViewAOD module
check LocalOverride_Nikhef_BASIC.py for other muon/tau/jet/MET collections
$ make_package TopViewAODtoNtuple-00-12-13-03

A typical Grid job

Normally, a Grid job in defined via a 'jdl'-file (and managed with LCG commands). The GridTools provide a slightly easier approach using 'jtf'-files which make it easier to define all the steps and nescessary modules that need to be in a job. In the end, these jtf-files are converted to jdl-files and standard LCG commands are used. Though the bookkeeping is simpler.

First time around you need to make a .jobreqs in your home-directory. The GridTools/jobmgr can take care of that. For example before running the job below you need to do:

$ jobmgr defreq AtlasRelease12.0.6 'Member(#@VO-atlas-production-12.0.6@#,#other.GlueHostApplicationSoftwareRunTimeEnvironment)'
(Pay attention to only one ' )

Defining a job

A typical AOD -> TVNTUPLE conversion job, eg. named test_Nikhef_AODtoTVNtuple.jtf might look like this:

--req AtlasRelease12.0.6

# --- Set up Atlas release 12.0.6 ---
AtlasRelease12.0.6

# --- Copy AOD from Grid SE ---
DataManager copy dset:test.AOD.v12000601[$SEQ/10] InputFiles

# --- Convert AOD to ntuple with TopView ---
TopViewAODtoNtuple-00-12-13-03 fullsim doFix

# --- Store output on Grid SE ---
DataManager copy OutputFiles dset:test.TVNTUPLE.v12000601

It executes the AtlasRelease12.0.6 package to set up Athena release 12.0.6, the DataManager package copies AOD's from the SE to the worker node (WN). Then the AOD's are converted with the TopViewAODtoNtuple package to TVNtuple's and finally the TVNtuple's are copied to the SE.
The line with [$SEQ/10] divides the AOD set into 10 equally sized subsets. So if there are 50 AOD's in the dataset, each job (1-10) converts 5 AOD's. Be careful: the number of AOD's in a dataset may change over time!
TopViewAODtoNtuple can be run with or without the 1mm-bug fix: doFix or noFix.

Playing around with a job

The best thing to do is to have a look at the GridTools yourself, but a few standard commands are worthwhile to mention here.
To submit the jobs defined in test_Nikhef_AODtoTVNtuple.jtf one has to do:

$ gridmgr submit --vo atlas test_Nikhef_AODtoTVNtuple.jtf:1-10

To retrieve the output from the jobs: NB: the stdout and stderror, not the TVTuples (these are stored on the SE)

$ gridmgr retrieve --dir ./output_test -a

To retrieve the TVNtuple's from the SE:

$ dmgr retrieve test.TVNTUPLE.v12000601 /data/atlas1/public/CSC/testsamples

To test a job locally before sending a complete set of jobs, it might be handy to check whether the job is correctly defined. jobmgr does the job for you:

$ jobmgr run test_Nikhef_AODtoTVNtuple.jtf:3

When you want to put some requirements on the CE's or you want to veto certain CE's, you can define these with jobmgr.

$ jobmgr listreq    

Requirement Name      Definition
--------------------  -------------------
AtlasRelease12.0.6    Member( "VO-atlas-production-12.0.6" , other.GlueHostApplicationSoftwareRunTimeEnvironment)

To use such a requirement for a job, this requirement has to be mentioned in the first line of the jtf-file. See the previous section where the AtlasRelease12.0.6 was required.

Ranja

With the GridTools and GridModules in place, you can now manage jobs easily. Though, to do this on a larger scale it is handy to use the Ranja python scripts on top of that. Ranja does the 'stupid' work which is very time consuming due to the bookkeeping, so you can sit back and relax...

Usage

Ranja does three things:

search for dataset files missing in the local LFC [lfc03.nikhef.nl] and copy them using the Tier-1 LFC's
(re)submit jobs by comparing whether for each AOD a ntuple has been made
retrieve ntuples from the Grid SE [tbn18.nikhef.nl] to local machine [/data/atlas1/public/CSC]

Installation

get ranja.
put it in the GridTools directory
in the GridTools directory: tar xvzf ranja_VERSION.tar.gz

You should have the following files:

ranja/
ranja/setup.py
ranja/__init__.py
ranja/tools.py
ranja/jobmgr.py
ranja/datamgr.py
ranja/lfcmgr.py
ranja_Nikhef_AODtoTVNtuple_00-12-13-03.jtf
ranja.py

Running

Before running Ranja make sure the grid environment is set up:

/global/ices/lcg/current/etc/profile.d/grid_env.sh
voms-proxy-init --voms atlas

For a list of options:

$ ranja.py --help

Note:

/grid/atlas/users/<username>/datasets should exist
(do a $ lfc-mkdir /grid/atlas/users/<username>/datasets)
before a submit, make sure you already did a copy
(otherwise you do not have a dataset to run over in /grid/atlas/users/<username>/datasets/)
although multiple instances of ranja.py can be run in parallel, gridmgr can NOT:
only do one submit at a time, otherwise the job database of gridmgr gets messed up!
to see the job status, just use 'gridmgr status -a'

Example:

copy
ranja.py -d trig1_misal1_csc11.005402.SU2_jimmy_susy.recon.AOD.v12000601_tid005862 -t trig1_misal1_csc11 -v -p copy
running in pretend-mode(-p) can be very useful in the beginning.
submit
ranja.py -d trig1_misal1_csc11.005402.SU2_jimmy_susy.recon.AOD.v12000601_tid005862 -t trig1_misal1_csc11 -g 20 -n TVNTUPLE_fix -j ranja_Nikhef_AODtoTVNtuple_00-12-13-03.jtf -v -p submit
when you're not running in pretend-mode, then you can check the status of your job as explained above
count AODs, ntuples on the Grid and ntuples locally
ranja.py -d trig1_misal1_csc11.005402.SU2_jimmy_susy.recon.AOD.v12000601_tid005862 -t trig1_misal1_csc11 -n TVNTUPLE_fix count

WARNING

Be careful when erasing datasets:

Deleting AODs

To delete AOD sets do NOT use (unless you know what you are doing):

$ dmgr delete <dset>
    
 OR:
 
$ lcg-del -v --vo atlas lfn:/grid/atlas/users/<username>/datasets/<dset>/<filename>

Instead, use:

$ ranja.py -t <sim_type> -d <dset> erase

Deleting TVNTUPLEs

It is safe to delete TVNTUPLEs using dmgr, lcg-del and ranja.py

Explanation

The AOD's used are either

Atlas replica's
private copies

In the former case, the LFC entries (/grid/atlas/users/<user>/datasets/<dset>) are referring to Atlas replica's (replicated centrally with DDM). Using dmgr delete <dset> or lcg-del would cause both the LFC entry and the fysical file on the SE to be deleted. But you do not want the 'official' replica to be deleted!!! Only the LFC entry should be removed. This is what you want if you used ranja.py copy. It seems that Atlas members can just delete AOD replica's from a SE, the AOD's at the SARA/Nikhef Tier-1 are protected against this. Though not on all sites (hopefully they will be in the near future).

In the latter case, you do want both the LFC entry and the fysical file to be deleted from the SE. Since in this case, the AOD is a private copy and only erasing only the LFC entry would make it impossible (?) to find back the fysical file ( = dark matter! ). This is what you want if you used dq2_get and lcg-cr to copy AOD's from a grid site to your favourite grid site, for private use (discouraged!). The same story is true for TVNTUPLEs: privately made.

ranja.py erase checks whether an AOD is a Atlas replica or a private copy (this is determined by looking at the fysical location on the SE). If it is an replica, the LFC entry is unlinked and erased (lcg-uf <guid> <surl>). If it is an private copy, both the LFC entry and file are deleted using lcg-del. In any other case, the program stops with a warning.

Comments

Use at your own risk ;-)

Shortcomings

Nope, let's not call them 'features'.

gridmgr: Running multiple instances of gridmgr at a time messes up the job database ${HOME}/.gridjobs. So just get coffee when you are submitting 100 jobs! By the way, it does not mean your job gets messed up (it will keep running on the grid), though it might not be possible anymore to retrieve the log files with gridmgr. So, it is not disastrous.
ranja.py: Running over multiple files with ranja.py -g will work as expected when it is the first run over a dataset. When it is a resubmission, ranja.py does notice which files have been done and which not (that why I wrote the script). Though it can not group only files that have not been done previously. So it is up to the user to decide whether it is worth/more efficient to run over 10 files (including over the, say for example, 4 files already done) or to run over a smaller group of files. In the limit (and by default :-) ), you run over just one file per job. Which is not too bad either! Possible solution: create temporary '<dataset>_todo' directory with links to AOD's still to be done and run over this 'todo' directory instead of the normal <dataset> directory.