Using DQ2 at NIKHEF
NOTE: DQ2 is ATLAS specific software (not LCG)
This page explains how to setup the DQ2 end-user tools for ATLAS DDM (distributed data management) for use at NIKHEF. The DDM tools operate on top of the GRID middleware so any use of DDM requires that you load your GRID certificate first. If you don't have a Grid certificate yet consult [1]
Set up
You can include these commands in your .cshrc to automate the setup of the DQ2 tools and the grid environment
-
source /global/ices/lcg/current/etc/profile.d/grid_env.csh
source /project/atlas/nikhef/dq2/dq2_setup.csh.NIKHEF
Key concepts of Atlas Distributed Data Management
Atlas DDM revolves around the concept of a 'dataset', which is a collection of files. The definition of what files are in a dataset are kept in a central oracle database at CERN.
Each file in DQ2 (and on the grid) has a logical file name, for easy human use, and a GRID unique identifier, or GUID in short, that is used internally to identify the file. The GUID of each file is assigned when it is stored for the first time on the GRID.
A key concept of Atlas DDM, as well as the grid middleware, is that a file may have multiple replicas, i.e. the same file is stored in multiple storage elements on the GRID.
The ATLAS computing model prescribes that all AOD files are replicated to the storage elements of all 10 Tier1 centers, of which SARA/NIKHEF is one. Thus there are nominally 10 copies of each AOD files available. In addition there are two copies of each ESD file available (currently one because of lack of disk space) that are distributed over the T-1s (each T1 holds two times 1/10 of the ESD data). Each Tier-1 site is also associated with one or more T-2 sites that can hold part of the AOD data.
The ATLAS DDM tools, named dq2, manages the data replication between Tier0 (cern), the Tier-1s and Tier-2s. The replication strategy is based on a subscription model, i.e. each site can enter a subscription to own a replica from a given source location. Management of these subscriptions is handled centrally by Atlas computing operations and data management people.
There exists a separate suite of dq2 'end-user' tools that physicists can use to inquire which datasets exists and to copy data files from one of the grid storage elements to their local computer
Dataset naming convention
The general naming convention of datasets is the following
trig1_misal1_mc12.005850.WH120bb_pythia.recon.ESD.v12000603_tid007457
- The first part trig1_misal1_mc12 is the production project, it shows which configuration was run, e.g. if the trigger was one, which detector alignment was used and which set of input generator files was used. Details on the naming convention of the project can be found here https://twiki.cern.ch/twiki/bin/view/Atlas/FileNamingConvention
- The next token 005850 is the dataset number. This is the unique definition of the Monte Carlo dataset. The authorative definition of each dataset is given by the corresponding jobOptions files, which are stored in Generators/DC3_joboptions. Here is a link to the CVS head of that package. http://atlas-sw.cern.ch/cgi-bin/viewcvs-atlas.cgi/offline/Generators/DC3_joboptions/share/
- The third token shows the production task that produced the output files. In this example it is 'recon' for reconstruction. Allowed value are 'evgen' for event generation, 'digit' for simulation and digitization (run as a single task) and 'recon' for reconstruction.
- The fourth token shows the type of file stored in this collection. Existing file types are 'HITS' for simulation output, 'RDO' for digitization output ('raw data' equivalent), 'ESD' for Event Summary Data, 'AOD' for Analysis object data, 'NTUP' for CBNTAA ntuples and 'log' for the log files.
- The fifth token shows the version of atlas software that was used to execute this processing step. Production version consists of an atlas release and a patch level. In the example above that release is 12.0.6 with patch level 3, usually denoted as 12.0.6.3. The lowest patch version possible is 1, so 12.0.6.1 is equivalent to release 12.0.6. Finally, all production tasks first store there output in a temporary dataset with a _tid00XXXX postfixed to the name. Only when the production task is finished, all data is copied over to the dataset with the final name. It is perfectly OK to use data from _tid tasks as long as you understand what data it is (e.g. unvalidated pilot samples). You can look up the status of production tasks in the panda browswer by entering the task ID in the bottom field of the form http://lxfsrk522.cern.ch:28243/?mode=taskquery
Basic use of DQ2 end-user tools
To inquire which datasets exists, use the dq2_ls tool. You can use wild cards. This shows which datasets exists.
- dq2_ls <pattern>
search for dataset with <pattern> - dq2_ls -g <dataset>
lists files in <dataset> - dq2_get -v -r <dataset>
get dataset
with verbose output
The option -r is needed to download files 'remotely' via the Grid. As far as I know (correct me if I'm wrong) it is not possible to copy directly from the Nikhef SE (like CASTOR at CERN).