VL-e Resource Guide
This document focuses on listing the grid resources that are available for VL-e project members. It does not explain how to use these resources; for that, you should consult the Grid Tutorial documentation. It is recommended that new users come to the Grid Tuturial, which is held every year.
The grid resources for VL-e are provided jointly by Nikhef and Sara, who participate in a larger framework for grid computing worldwide, in particular for the high energy physics experiments of the LHC. The grid middleware is provided by the European EGEE project. As a national project, VL-e has to share the resources with other applications.
Nikhef and Sara play a somewhat different role: Nikhef focuses mainly on computational clusters, while Sara has a tape storage facility for long-term data storage.
Computing
The clusters are accessible through the gLite stack of software developed by the EGEE project, based largely on Globus. You can use the command-line tools, as explained in the Grid Tutorial.
The compute elements
Instead of focusing on how to make use of the resources, this document tries to complement the Grid Tutorial by explaining which resources are available to VL-e participants.
The following list shows the compute elements that are available to VL-e VOs. This information has been retrieved with lcg-infosites --vo pvier.
The information system can be publicly queried by standard ldap tools, such as ldapsearch:
$ ldapsearch -x -H ldap://bdii03.nikhef.nl:2170/ -b 'mds-vo-name=NIKHEF-ELPROD,mds-vo-name=local,o=grid'
An exellent ldap browser (written in Java) is found here[1]. To browse the information system, connect to bdii03.nikhef.nl on port 2170 and enter 'o=grid' as the base dn.
Nikhef Data Processing Facility News
A common mistake in writing grid jobs (from personal experience): forgetting to send the executable script in the inputsandbox; forgetting to set the executable flag on the script. Resource name maximum wall time comments tbn20.nikhef.nl:2119/jobmanager-pbs-qlong 30 hours The Nikhef cluster has approx. 400 CPUs tbn20.nikhef.nl:2119/jobmanager-pbs-qshort 5 hours mu6.matrix.sara.nl:2119/jobmanager-pbs-short 4 hours The Matrix cluster will be gone soon mu6.matrix.sara.nl:2119/jobmanager-pbs-medium 33 hours ce.gina.sara.nl:2119/jobmanager-pbs-short 4 hours GINA (Grid In Almere) has 128 CPUs ce.gina.sara.nl:2119/jobmanager-pbs-medium 33 hours
Matrix and GINA also have express queues for short jobs; these queues are ideal for tests, since jobs will run almost immediately. These queues are not for production work, and jobs that take up over 10 minutes will automatically be terminated.
Each of these queues can be accessed through the Globus Gatekeeper (Globus 2 flavour) [give example], but the preferred way for running jobs is through job submission to a resource broker.
If you submit work by other means, for instance as part of a workflow package, the Condor-G [2] method is preferred over globus-job-run.
Your VO affiliation greatly affects your ability to run jobs. While the dutch sites support the VL-e VOs, they also support many other VOs on the same infrastructure. This leads to competition for cycles. The mechanism to address this issue is to allow a fair share of the cycles to be used by a VO, and to give higher priority to VOs who have used little of their fair share in the last period.
The exact calculations of the fair share are somewhat of a black art.
Storage
Like computing, storage on the grid is about scale. Large-scale computations go hand in hand with large-scale demands for (intermediate) storage.
The storage provided by Nikhef and Sara is arranged as follows (output of lcg-infosites --vo pvier se): (information on 22 August, 2007) Avail Space(Kb) Used Space(Kb) SEs Remarks 482906560 1587920944 tbn15.nikhef.nl classic SE; don't use this old one. 1710000000 137730 tbn18.nikhef.nl Modern DPM system with SRM interface; use this one. 317044396 539916244 mu2.matrix.sara.nl DCache system with SRM interface
Note that the numbers may be different for different VOs, and there may actually be fewer SEs showing up.
Besides these SRM enabled systems, there is a SRB system provided by Sara. Unfortunately it cannot be used with the standard grid tools for logical file names, replicas, etc.; however, there is a gridftp front end. You need to obtain an account by contacting Sara (see the quickstart document).
The bottom line for storage is that you should voice your particular needs, because by default there are no guarantees regarding availability of space and protection against data corruption. If your use-case involves large quantities of data, pick up a phone and dial 888-GRID-SUPPORT to order.
* Nikhef is disk only, using DPM (Disk Pool Management) on tbn18. * Sara has disk and tape, but you have to contact them if you have special needs (such as tape storage).
You can supply the inputdata key in your jdl and give a logical file name. This will trigger the resource broker to attempt to find a compute element 'close' to some replica of the mentioned file.
The brokerinfo will then show a SURL of the file, from which a transfer URL can be obtained with the SRM tools. Grid UI machines
Accessing the Grid means using grid tool; installing these tools can be tricky. (See the VL-e PoC distribution page). Currently, there are a number of choices:
* ui.grid.sara.nl, a centrally provided machine at Sara. Contact Sara to obtain a local account. * install the VL-e PoC distribution * install the VMWare image of the VL-e PoC distribution.
Monitoring
Documentation
Links
* The Grid Tutorial 2006. https://gforge.vl-e.nl/docman/view.php/10/88/GridTutorial2006.pdf * to see status/load of matrix cluster, look at Ganglia tools o Matrix cluster load for last hour o Matrix joblist report * to see system status (maintenance, etc.): http://www.sara.nl/systemstatus/systemstatus_eng.php3