Using DQ2 at NIKHEF

From Atlas Wiki
Jump to navigation Jump to search

NOTE: DQ2 is ATLAS specific software (not LCG)

This page explains how to setup the DQ2 end-user tools for ATLAS DDM (distributed data management) for use at NIKHEF. The DDM tools operate on top of the GRID middleware so any use of DDM requires that you load your GRID certificate first. If you don't have a Grid certificate yet consult [1]


Set up

You can include these commands in your .cshrc to automate the setup of the DQ2 tools and the grid environment

    source /global/ices/lcg/current/etc/profile.d/grid_env.csh
    source /project/atlas/nikhef/dq2/dq2_setup.csh.NIKHEF

# Derived from
# /afs/usatlas.bnl.gov/Grid/Don-Quijote/dq2_user_client/setup.sh.any

if (! $?LCG_LOCATION ) then
        echo 'ERROR : setup Grid first!'
        exit
endif

# DQ2 server
setenv DQ2_URL_SERVER http://atlddmcat.cern.ch/dq2/
setenv DQ2_URL_SERVER_SSL https://atlddmcat.cern.ch:443/dq2/

# local site ID
#
# e.g., for CERN
# setenv DQ2_LOCAL_ID CERN
#
# if your site doesn't deploy a DQ2 site service
# setenv DQ2_LOCAL_ID ''
setenv DQ2_LOCAL_ID NIKHEF

# access protocol to local storage
#
# for CASTOR
# setenv DQ2_LOCAL_PROTOCOL rfio
#
# for dCache
# setenv DQ2_LOCAL_PROTOCOL dcap
#
# for normal disk storage
# setenv DQ2_LOCAL_PROTOCOL unix
setenv DQ2_LOCAL_PROTOCOL dpm


# root directory of local storage.
#
# e.g.,
# setenv DQ2_STORAGE_ROOT /castor
# setenv DQ2_STORAGE_ROOT /pnfs
#
# if you don't have special mount point for storage
# setenv DQ2_STORAGE_ROOT ''
setenv DQ2_STORAGE_ROOT /dpm

# local SRM host
#
# e.g.
# setenv DQ2_SRM_HOST srm://castorgrid.cern.ch:8443
#
# if your site doesn't deploy an SRM server
# setenv DQ2_SRM_HOST ''
setenv DQ2_SRM_HOST srm://tbn18.nikhef.nl:8443

# local GSIFTP host
#
# e.g.
# setenv DQ2_GSIFTP_HOST gsiftp://castorgrid.cern.ch:2811
#
# if your site doesn't deploy a GSIFTP server
# setenv DQ2_GSIFTP_HOST ''
setenv DQ2_GSIFTP_HOST gsiftp://tbn18.nikhef.nl:2811

# use SRM for all data transfer (default: False)
setenv DQ2_USE_SRM False

# LFC
setenv LCG_CATALOG_TYPE lfc
setenv DQ2_LFC_HOME /grid/atlas

# which command is called in dq2_get. Specify this when srmcp doesn't work in your environment
#setenv DQ2_COPY_COMMAND 'lcg-cp -v --vo atlas'
# GOSSIE: the next line is tricky (It works better than not defining it explicitely). Whatever... It works :)
setenv DQ2_COPY_COMMAND 'srmcp'

# PATH
setenv PATH /afs/usatlas.bnl.gov/Grid/Don-Quijote/dq2_user_client:$PATH

# prevent the wild-card expansion
alias dq2_ls 'set noglob; \dq2_ls \!*; unset noglob'
alias dq2_cr 'set noglob; \dq2_cr \!*; unset noglob'
alias dq2_get 'set noglob; \dq2_get \!*; unset noglob'

Key concepts of Atlas Distributed Data Management

Atlas DDM revolves around the concept of a 'dataset', which is a collection of files. The definition of what files are in a dataset are kept in a central oracle database at CERN.

Each file in DQ2 (and on the grid) has a logical file name, for easy human use, and a GRID unique identifier, or GUID in short, that is used internally to identify the file. The GUID of each file is assigned when it is stored for the first time on the GRID.

A key concept of Atlas DDM, as well as the grid middleware, is that a file may have multiple replicas, i.e. the same file is stored in multiple storage elements on the GRID.

The ATLAS computing model prescribes that all AOD files are replicated to the storage elements of all 10 Tier1 centers, of which SARA/NIKHEF is one. Thus there are nominally 10 copies of each AOD files available. In addition there are two copies of each ESD file available (currently one because of lack of disk space) that are distributed over the T-1s (each T1 holds two times 1/10 of the ESD data). Each Tier-1 site is also associated with one or more T-2 sites that can hold part of the AOD data.

The ATLAS DDM tools, named dq2, manages the data replication between Tier0 (cern), the Tier-1s and Tier-2s. The replication strategy is based on a subscription model, i.e. each site can enter a subscription to own a replica from a given source location. Management of these subscriptions is handled centrally by Atlas computing operations and data management people.

There exists a separate suite of dq2 'end-user' tools that physicists can use to inquire which datasets exists and to copy data files from one of the grid storage elements to their local computer

Dataset naming convention

The general naming convention of datasets is the following

trig1_misal1_mc12.005850.WH120bb_pythia.recon.ESD.v12000603_tid007457

  1. The first part trig1_misal1_mc12 is the production project, it shows which configuration was run, e.g. if the trigger was one, which detector alignment was used and which set of input generator files was used. Details on the naming convention of the project can be found here https://twiki.cern.ch/twiki/bin/view/Atlas/FileNamingConvention
  2. The next token 005850 is the dataset number. This is the unique definition of the Monte Carlo dataset. The authorative definition of each dataset is given by the corresponding jobOptions files, which are stored in Generators/DC3_joboptions. Here is a link to the CVS head of that package. http://atlas-sw.cern.ch/cgi-bin/viewcvs-atlas.cgi/offline/Generators/DC3_joboptions/share/
  3. The third token shows the production task that produced the output files. In this example it is 'recon' for reconstruction. Allowed value are 'evgen' for event generation, 'digit' for simulation and digitization (run as a single task) and 'recon' for reconstruction.
  4. The fourth token shows the type of file stored in this collection. Existing file types are 'HITS' for simulation output, 'RDO' for digitization output ('raw data' equivalent), 'ESD' for Event Summary Data, 'AOD' for Analysis object data, 'NTUP' for CBNTAA ntuples and 'log' for the log files.
  5. The fifth token shows the version of atlas software that was used to execute this processing step. Production version consists of an atlas release and a patch level. In the example above that release is 12.0.6 with patch level 3, usually denoted as 12.0.6.3. The lowest patch version possible is 1, so 12.0.6.1 is equivalent to release 12.0.6. Finally, all production tasks first store there output in a temporary dataset with a _tid00XXXX postfixed to the name. Only when the production task is finished, all data is copied over to the dataset with the final name. It is perfectly OK to use data from _tid tasks as long as you understand what data it is (e.g. unvalidated pilot samples). You can look up the status of production tasks in the panda browswer by entering the task ID in the bottom field of the form http://lxfsrk522.cern.ch:28243/?mode=taskquery


Browsing datasets with DQ2 end-user tools

To inquire which datasets exists, use the dq2_ls tool. You can use wild cards. This shows which datasets

  dq2_ls <pattern>

The setup script defines the dq2_ls command such that you don't have to put wildcards (*) in quotes. For example the following command will list all datasets with AOD files of sample 5200 (ttbar) that were reconstruction with Atlas release 12.0.6.1

unix> dq2_ls *5200*AOD*v12000601*

trig0_calib0_mc12.005200.T1_McAtNlo_Jimmy.recon.AOD.v12000601
trig0_calib0_mc12.005200.T1_McAtNlo_Jimmy.recon.AOD.v12000601_tid006840
trig1_misal1_mc12.005200.T1_McAtNlo_Jimmy.recon.AOD.v12000601
trig1_misal1_mc12.005200.T1_McAtNlo_Jimmy.recon.AOD.v12000601_tid005997
trig1_misal1_mc12_V1.005200.T1_McAtNlo_Jimmy.recon.AOD.v12000601
trig1_misal1_mc12_V1.005200.T1_McAtNlo_Jimmy.recon.AOD.v12000601_tid006839
trig1_misal1_mc12_V1_V1.005200.T1_McAtNlo_Jimmy.recon.AOD.v12000601
trig1_misal1_mc12_V1_V1.005200.T1_McAtNlo_Jimmy.recon.AOD.v12000601_tid007021
trig1_misal1_mc12_V2.005200.T1_McAtNlo_Jimmy.recon.AOD.v12000601
trig1_misal1_mc12_V2.005200.T1_McAtNlo_Jimmy.recon.AOD.v12000601_tid007021
user.FrankE.Paige.trig1_misal1_mc12.005200.T1_McAtNlo_Jimmy.recon.AOD.v12000601.THIN.v2
user.FrankE.Paige.trig1_misal1_mc12.005200.T1_McAtNlo_Jimmy.recon.AOD.v12000601.THIN.v3
user.TARRADEFabien.trig1_misal1_mc12.005200.T1_McAtNlo_Jimmy.recon.AOD.v12000601.HiggsToTauTau_00_00_39.AAN
user.VivekJain.Btag_Valid_trig1_misal1_mc12.005200.T1_McAtNlo_Jimmy.recon.AOD.v12000601

In this example you see both the final dataset names and the 'temparary' dataset names for datasets that are still being filled by the production system, as well as a couple of user datasets.

By default dq2_ls shows you all datasets available. If you want to restrict the list to the datasets for which a replica exists at a given site, e.g. NIKHEF, you can do

  dq2_ls -s NIKHEF <pattern>

To examine the file count of a given dataset do

  dq2_ls -f <datasetName>

For example

unix>dq2_ls -f trig0_calib0_mc12.005200.T1_McAtNlo_Jimmy.recon.AOD.v12000601
trig0_calib0_mc12.005200.T1_McAtNlo_Jimmy.recon.AOD.v12000601   Total: 239  - Local: 0

The total file count refers to the number of files that exist in the dataset whereas the local count refers to the number of files replicated to the local storage element. For the dq2 setup at NIKHEF the local storage element is SARA/NIKHEF. In the CERN setup of dq2 it is CERN-CASTOR.

Finally you can see the exact file contents of a dataset as follows

  dq2_ls -g -f -l <datasetName>
unix>dq2_ls -g -f -l trig0_calib0_mc12.005200.T1_McAtNlo_Jimmy.recon.AOD.v12000601
    trig0_calib0_mc12.005200.T1_McAtNlo_Jimmy.recon.AOD.v12000601   Total: 239  - Local: 0
    2E75C2D6-C5D9-DB11-BDC0-001422732AC3 trig0_calib0_mc12.005200.T1_McAtNlo_Jimmy.recon.AOD.v12000601_tid006840._00001.pool.root.1 101467770
    40F3A844-CDD9-DB11-8911-001422730F00 trig0_calib0_mc12.005200.T1_McAtNlo_Jimmy.recon.AOD.v12000601_tid006840._00002.pool.root.1 100407509
    2ABDB7AA-D6D9-DB11-830C-001422730CD2 trig0_calib0_mc12.005200.T1_McAtNlo_Jimmy.recon.AOD.v12000601_tid006840._00003.pool.root.1 94570398
    2A2AC3AA-D6D9-DB11-B964-00142272F9BE trig0_calib0_mc12.005200.T1_McAtNlo_Jimmy.recon.AOD.v12000601_tid006840._00004.pool.root.1 96730415
    4C03183F-90DC-DB11-B921-00123F20A25B trig0_calib0_mc12.005200.T1_McAtNlo_Jimmy.recon.AOD.v12000601_tid006840._00005.pool.root.2 97204028
    62FD451C-93DC-DB11-9F35-00A0D1E50595 trig0_calib0_mc12.005200.T1_McAtNlo_Jimmy.recon.AOD.v12000601_tid006840._00006.pool.root.2 96898637
    9247C695-92DC-DB11-9B16-00A0D1E50539 trig0_calib0_mc12.005200.T1_McAtNlo_Jimmy.recon.AOD.v12000601_tid006840._00007.pool.root.2 95691866
    B4BE56DF-92DC-DB11-9AA4-00A0D1E507E7 trig0_calib0_mc12.005200.T1_McAtNlo_Jimmy.recon.AOD.v12000601_tid006840._00008.pool.root.2 95925431
    1ACDB1AD-C5D9-DB11-B040-00A0D1E4F835 trig0_calib0_mc12.005200.T1_McAtNlo_Jimmy.recon.AOD.v12000601_tid006840._00009.pool.root.1 97868381
    76CF554D-C4D9-DB11-9F75-00123F20B89C trig0_calib0_mc12.005200.T1_McAtNlo_Jimmy.recon.AOD.v12000601_tid006840._00010.pool.root.1 98572268
    F4D9179A-C3D9-DB11-85DD-00123F20ABE2 trig0_calib0_mc12.005200.T1_McAtNlo_Jimmy.recon.AOD.v12000601_tid006840._00011.pool.root.1 95862624
    760DD38F-C6D9-DB11-AED6-00A0D1E5030D trig0_calib0_mc12.005200.T1_McAtNlo_Jimmy.recon.AOD.v12000601_tid006840._00012.pool.root.1 96828192
    ...

To restrict the list of files to those that are replicated at the local storage element, omit the '-g' flag.

In general dq2 transactions on large datasets (e.g. the complete sample 5200 consists of almost 10000 files) can take several minutes. For your peace of mind you can always add the '-v' flag for extra verbosity. This will show all the transactions with the database server as they happen and give an indication of progress.


Retrieving datasets with DQ2 end-user tools

To copy the files of dataset to your local harddisk use the dq2_get utility. Without further options, the utility will only copy files that are present in the 'local' storage element, SARA in the NIKHEF setup

dq2_get [-v] <datasetName>

Note that dq2_get by default runs three file transfers in parallel.

If you are also interested in the files of the dataset that are not at the local SE, add the -r option:

dq2_get [-v] -r <datasetName>

and the utility will attempt to retrieve the files that are missing from the local storage element from one of the other storage elements where that file is replicated. If there are multiple replicas available you'll be prompted for every file to specify the location to retrieve it from. You can streamline this process by providing a remote location with the -c <SElocation> command line option.


Links