Difference between revisions of "VL-e Resource Guide"

From PDP/Grid Wiki

Latest revision as of 15:47, 15 February 2010

This document focuses on listing the grid resources that are available for VL-e project members. It does not explain how to use these resources; for that, you should consult the Grid Tutorial documentation. It is recommended that new users come to the Grid Tuturial, which is held every year.

The grid resources for VL-e are provided jointly by Nikhef and Sara, who participate in a larger framework for grid computing worldwide, in particular for the high energy physics experiments of the LHC. The grid middleware is provided by the European EGEE project. As a national project, VL-e has to share the resources with other applications.

Nikhef and Sara play a somewhat different role: Nikhef focuses mainly on computational clusters, while Sara has a tape storage facility for long-term data storage.

Computing

The clusters are accessible through the gLite stack of software developed by the EGEE project, based largely on Globus. You can use the command-line tools, as explained in the Grid Tutorial. If you require the VL-e Proof of Concept software distribution, you should add a requirement to your JDL file that says:

Requirements = Member("nl.vl-e.poc-release-2", other.GlueHostApplicationSoftwareRunTimeEnvironment);

The following list shows the compute elements that are available to VL-e VOs. This information can be retrieved with

lcg-infosites --vo pvier ce

and will vary according to per-VO configurations at the sites. The information system can be publicly queried by standard ldap tools, such as ldapsearch:

ldapsearch -x -H ldap://bdii03.nikhef.nl:2170/ -b 'mds-vo-name=NIKHEF-ELPROD,mds-vo-name=local,o=grid'

An excellent ldap browser (written in Java) is found here[1]. To browse the information system, connect to bdii03.nikhef.nl on port 2170 and enter 'o=grid' as the base dn.

List of available queues
Resource name	maximum wall time	comments
tbn20.nikhef.nl:2119/jobmanager-pbs-qlong	30 hours	The Nikhef cluster has 1156 CPU cores
tbn20.nikhef.nl:2119/jobmanager-pbs-qshort	5 hours
ce.gina.sara.nl:2119/jobmanager-pbs-short	4 hours	GINA (Grid In Almere) has 160 CPU cores
ce.gina.sara.nl:2119/jobmanager-pbs-medium	33 hours

Your VO affiliation greatly affects your ability to run jobs. While the dutch sites support the VL-e VOs, they also support many other VOs on the same infrastructure. This leads to competition for cycles. The mechanism to address this issue is to allow a fair share of the cycles to be used by a VO, and to give higher priority to VOs who have used little of their fair share in the last period.

Scheduling on the cluster is straightforward, but from a user perspective it is less straightforward to predict how quickly your job will run. The parameter that is of most interest is the estimated response time, which expresses the expected time it will take for a job that is submitted right now to start running on a cluster. See for instance this graph.

Running tests and debugging

If your grid jobs are not behaving as expected, debugging can be a really frustrating ordeal. You have no way to inspect a running job up close, and the turnaround for each modification to your job is high.

To shorten the turnaround, you may request (from mailto:grid.support@sara.nl) the privilege to use the express queues on the GINA an Matrix clusters. The restriction is that jobs may last no longer than a couple of minutes.

If this is still not enough, and an application requires serious testing and debugging, you may request the use of the VL-e P4 Certification Test Bed. Contact mailto:vle-pfour-team@lists.vl-e.nl for support.

Storage

Grid Storage Elements can be discovered with the command

lcg-infosites --vo pvier se

(where you should replace pvier by the name of your own VO).

storage elements reported on August 22, 2007 for pvier
Avail Space(Kb)	Used Space(Kb)	SEs	Remarks
482906560	1587920944	tbn15.nikhef.nl	classic SE; don't use this old one.
1710000000	137730	tbn18.nikhef.nl	Modern DPM system with SRM interface; use this one.
317044396	539916244	mu2.matrix.sara.nl	DCache system with SRM interface

Note that the numbers may be different for other VOs, and there may actually be fewer SEs showing up. If you need more disk quota, please contact grid support (mailto:grid.support@nikhef.nl or mailto:grid.support@sara.nl).

LFC

The Logical File Catalog (LFC) is a way to have easy to remember aliases for grid storage files. Think symbolic links. To use the lfc-* tools, set

LFC_HOST=lfc.grid.sara.nl
export LFC_HOST

and of course you need a valid proxy. Now you can do, e.g.

lfc-ls /grid/vlemed/

Tape storage

There is no default storage to tape. If you need tape storage, you have to explicitly request it. Sara can provide tape storage with various regimes, such as automatic disk-to-tape migration. Contact mailto:grid.support@sara.nl for more information.

about data replication

The Grid Tutorial explains how you can replicate your data to multiple storage elements, so that when you submit a job that needs this data, the resource broker may find a compute element 'close' to one of your copies. The current situation for VL-e members is such that all clusters on which a job may land are within the Netherlands. Thus, there is no gain in having more than one replica of your data around to improve the transfer efficiency.

Grid Access

Accessing the Grid means using grid tool; installing these tools can be tricky. Currently, there are a number of choices:

ui.grid.sara.nl, a centrally provided machine at Sara. Contact mailto:grid.support@sara.nl to obtain a local account.
install the VL-e PoC distribution (See the VL-e PoC distribution page.)
install the VMWare image of the VL-e PoC distribution.

Monitoring

There are various ways to get information about the current state of the Grid and your grid jobs.

If you want to be informed of updates, downtimes, etc. you can subscribe to the infrastructure-announce mailing list.

Sara uses Ganglia:

Nikhef shows some useful statistics:

Nikhef production facility statistics

External monitoring tools

VO specific monitoring

VLemed dashboard

Documentation

Links

Fokke Dijkstra, Jeroen Engelberts, Sjors Grijpink, David Groep, Arnold Meijster, Jeff Templon@nikhef.nl. The Grid Tutorial handouts 2007. This is the must read beginner's guide to the grid for VL-e users; it covers topics like
- Grid security; getting a certificate, registering with a VO,
- Job submission, and
- Data Management.
Stephen Burke, Simone Campana, Antonio Delgado Peris, Flavia Donno, Patricia Méndez Lorenzo, Roberto Santinelli, Andrea Sciabà, gLite 3 User Guide
Nikhef Grid information
NDPF Node Functions
Sara system status
NDPF News Nikhef Data Processing Facility news
the DutchGrid and Nikhef Certification Authority
HOWTO: Start using the grid, a quickstart tutorial This quickstart guide is aimed at vlemed users, but it has lots of good stuff for other users, too.
VLeMed fMRI presentation by Silvia D. Olabarriaga, presented on the Grid Tutorial 26-9-2007, Surfnet, Utrecht

Retrieved from "https://wiki.nikhef.nl/grid/index.php?title=VL-e_Resource_Guide&oldid=7015"

@@ Line 1: / Line 1: @@
-This document focuses on listing the grid resources that are available for VL-e project members. It does not explain how to use these resources; for that, you should consult the [[Media:GridTutorial2006.pdf|Grid Tutorial]] documentation. It is recommended that new users come to the Grid Tuturial, which is held every year.
+This document focuses on listing the grid resources that are available for [http://www.vl-e.nl VL-e] project members. It does not explain how to use these resources; for that, you should consult the [[Media:GridTutorial2006.pdf|Grid Tutorial]] documentation. It is recommended that new users come to the Grid Tuturial, which is held every year.
 The grid resources for VL-e are provided jointly by Nikhef and Sara, who participate in a  larger framework for
@@ Line 9: / Line 9: @@
 == Computing ==
-The clusters are accessible through the [http://glite.web.cern.ch/glite/ gLite] stack of software developed by the EGEE project, based largely on Globus. You can use the command-line tools, as explained in the [[Media:GridTutorial2006.pdf|Grid Tutorial]].
+The clusters are accessible through the [http://glite.web.cern.ch/glite/ gLite] stack of software developed by the EGEE project, based largely on Globus. You can use the command-line tools, as explained in the [[Media:GridTutorial2006.pdf|Grid Tutorial]]. If you require the VL-e Proof of Concept software distribution, you should add a requirement to your JDL file that says:
+ Requirements = Member("nl.vl-e.poc-release-2", other.GlueHostApplicationSoftwareRunTimeEnvironment);
 The following list shows the compute elements that are available to VL-e VOs. This information can be retrieved with
-  lcg-infosites --vo pvier.
+  lcg-infosites --vo pvier ce
 and will vary according to per-VO configurations at the sites.
 The information system can be publicly queried by standard ldap tools, such as ldapsearch:
   ldapsearch -x -H ldap://bdii03.nikhef.nl:2170/ -b 'mds-vo-name=NIKHEF-ELPROD,mds-vo-name=local,o=grid'
-An exellent ldap browser (written in Java) is found here[http://www.mcs.anl.gov/~gawor/ldap]. To browse the
+An excellent ldap browser (written in Java) is found here[http://www.mcs.anl.gov/~gawor/ldap]. To browse the
 information system, connect to bdii03.nikhef.nl on port 2170 and enter 'o=grid' as the base dn.
-Resource name 	maximum wall time 	comments
+{| cellpadding="10" cellspacing="0" style="border: 1px solid #eee"
-tbn20.nikhef.nl:2119/jobmanager-pbs-qlong 	30 hours 	The Nikhef cluster has approx. 400 CPUs
+|+ align="bottom" style="font-style: italic"|List of available queues
-tbn20.nikhef.nl:2119/jobmanager-pbs-qshort 	5 hours
+|-
-mu6.matrix.sara.nl:2119/jobmanager-pbs-short 	4 hours 	The Matrix cluster will be gone soon
+!style="text-align: left;"|Resource name
-mu6.matrix.sara.nl:2119/jobmanager-pbs-medium 	33 hours
+!style="text-align: left;"|maximum wall time
-ce.gina.sara.nl:2119/jobmanager-pbs-short 	4 hours 	GINA (Grid In Almere) has 128 CPUs
+!style="text-align: left;"|comments
-ce.gina.sara.nl:2119/jobmanager-pbs-medium 	33 hours
+|-bgcolor="#eee"
+|tbn20.nikhef.nl:2119/jobmanager-pbs-qlong 	||30 hours 		||The Nikhef cluster has 1156 CPU cores
+|-
+|tbn20.nikhef.nl:2119/jobmanager-pbs-qshort 	||5 hours
+|-bgcolor="#eee"
+|ce.gina.sara.nl:2119/jobmanager-pbs-short 	||4 hours 		||GINA (Grid In Almere) has 160 CPU cores
+|-
+|ce.gina.sara.nl:2119/jobmanager-pbs-medium 	||33 hours
+|}
-Matrix and GINA also have express queues for short jobs; these queues are ideal for tests, since jobs will run almost immediately. These queues are not for production work, and jobs that take up over 10 minutes will automatically be terminated.
-Each of these queues can be accessed through the Globus Gatekeeper (Globus 2 flavour) [give example], but the preferred way for running jobs is through job submission to a resource broker.
+Your VO affiliation greatly affects your ability to run jobs. While the dutch sites support the VL-e VOs, they also support many other VOs on the same infrastructure. This leads to competition for cycles. The mechanism to address this issue is to allow a [[fair share]] of the cycles to be used by a VO, and to give higher priority to VOs who have used little of their fair share in the last period.
-If you submit work by other means, for instance as part of a workflow package, the Condor-G [http://www.cs.wisc.edu/condor/condorg/] method is preferred over globus-job-run.
+[[Scheduling]] on the cluster is straightforward, but from a user perspective it is less straightforward to predict how quickly your job will run. The parameter that is of most interest is the [[estimated response time]], which expresses the expected time it will take for a job that is submitted right now to start running on a cluster. See for instance [http://www.nikhef.nl/grid/stats/ndpf-prd/grisview-short this graph].
-Your VO affiliation greatly affects your ability to run jobs. While the dutch sites support the VL-e VOs, they also support many other VOs on the same infrastructure. This leads to competition for cycles. The mechanism to address this issue is to allow a fair share of the cycles to be used by a VO, and to give higher priority to VOs who have used little of their fair share in the last period.
+=== Running tests and debugging ===
-The exact calculations of the fair share are somewhat of a black art.
+If your grid jobs are not behaving as expected, debugging can be a really frustrating ordeal. You have
+no way to inspect a running job up close, and the turnaround for each modification to your job is high.
+To shorten the turnaround, you may request (from mailto:grid.support@sara.nl) the privilege to use the express queues on the GINA an Matrix clusters. The restriction is that jobs may last no longer than a couple of minutes.
+If this is still not enough, and an application requires serious testing and debugging, you may request the use of the VL-e P4 Certification Test Bed. Contact mailto:vle-pfour-team@lists.vl-e.nl for support.
 == Storage ==
+Grid Storage Elements can be discovered with the command
+ lcg-infosites --vo pvier se
+(where you should replace ''pvier'' by the name of your own VO).
+{|cellpadding="10" cellspacing="0" style="border: 1px solid #eee"
+|+align="bottom" style="font-style: italic"|storage elements reported on August 22, 2007 for pvier
+|-
+!style="text-align: left;"|Avail Space(Kb)
+!style="text-align: left;"|Used Space(Kb)
+!style="text-align: left;"|SEs
+!style="text-align: left;"|Remarks
+|-bgcolor="#eee"
+|482906560 	||1587920944 	||tbn15.nikhef.nl 	||classic SE; don't use this old one.
+|-
+|1710000000 	||137730 	||tbn18.nikhef.nl 	||Modern DPM system with SRM interface; use this one.
+|-bgcolor="#eee"
+|317044396 	||539916244 	||mu2.matrix.sara.nl 	||DCache system with SRM interface
+|}
+Note that the numbers may be different for other VOs, and there may actually be fewer SEs showing up. If you need
+more disk quota, please contact grid support (mailto:grid.support@nikhef.nl or mailto:grid.support@sara.nl).
+=== LFC ===
+The Logical File Catalog (LFC) is a way to have easy to remember aliases for grid storage files. Think symbolic links. To use the lfc-* tools, set
+ LFC_HOST=lfc.grid.sara.nl
+ export LFC_HOST
+and of course you need a valid proxy. Now you can do, e.g.
-Like computing, storage on the grid is about scale. Large-scale computations go hand in hand with large-scale demands for (intermediate) storage.
+ lfc-ls /grid/vlemed/
-The storage provided by Nikhef and Sara is arranged as follows (output of lcg-infosites --vo pvier se):
-(information on 22 August, 2007)
-Avail Space(Kb)	Used Space(Kb)	SEs 	Remarks
-482906560 	1587920944 	tbn15.nikhef.nl 	classic SE; don't use this old one.
-1710000000 	137730 	tbn18.nikhef.nl 	Modern DPM system with SRM interface; use this one.
-317044396 	539916244 	mu2.matrix.sara.nl 	DCache system with SRM interface
-Note that the numbers may be different for different VOs, and there may actually be fewer SEs showing up.
-Besides these SRM enabled systems, there is a SRB system provided by Sara. Unfortunately it cannot be used with the standard grid tools for logical file names, replicas, etc.; however, there is a gridftp front end. You need to obtain an account by contacting Sara (see the quickstart document).
+=== Tape storage ===
-The bottom line for storage is that you should voice your particular needs, because by default there are no guarantees regarding availability of space and protection against data corruption. If your use-case involves large quantities of data, pick up a phone and dial 888-GRID-SUPPORT to order.
+There is no default storage to tape. If you need tape storage, you have to explicitly request it. Sara can provide tape storage with various regimes, such as automatic disk-to-tape migration. Contact mailto:grid.support@sara.nl for more information.
-    * Nikhef is disk only, using DPM (Disk Pool Management) on tbn18.
+=== about data replication ===
-    * Sara has disk and tape, but you have to contact them if you have special needs (such as tape storage).
-You can supply the inputdata key in your jdl and give a logical file name. This will trigger the resource broker to attempt to find a compute element 'close' to some replica of the mentioned file.
+The Grid Tutorial explains how you can replicate your data to multiple storage elements, so that when you submit a job that needs this data, the resource broker may find a compute element 'close' to one of your copies. The current situation for VL-e members is such that all clusters on which a job may land are within the Netherlands. Thus, there is no gain in having more than one replica of your data around to improve the transfer efficiency.
-The brokerinfo will then show a SURL of the file, from which a transfer URL can be obtained with the SRM tools.
+== Grid Access ==
-Grid UI machines
-Accessing the Grid means using grid tool; installing these tools can be tricky. (See the VL-e PoC distribution page). Currently, there are a number of choices:
+Accessing the Grid means using grid tool; installing these tools can be tricky. Currently, there are a number of choices:
-    * ui.grid.sara.nl, a centrally provided machine at Sara. Contact Sara to obtain a local account.
+* ui.grid.sara.nl, a centrally provided machine at Sara. Contact mailto:grid.support@sara.nl to obtain a local account.
-    * install the VL-e PoC distribution
+* install the VL-e PoC distribution (See the [http://poc.vl-e.nl/distribution/ VL-e PoC distribution] page.)
-    * install the VMWare image of the VL-e PoC distribution.
+* install the [http://poc.vl-e.nl/distribution/vmware_r2.html VMWare image] of the VL-e PoC distribution.
 == Monitoring ==
-== Documentation ==
+There are various ways to get information about the current state of the Grid and your grid jobs.
+If you want to be informed of updates, downtimes, etc. you can subscribe to the [https://lists.vl-e.nl/mailman/listinfo/infrastructure-announce infrastructure-announce mailing list].
+Sara uses Ganglia:
+* [http://ganglia.sara.nl/?c=GINA%20Cluster&m=&r=hour&s=descending&hc=4 GINA cluster load for last hour]
+* [http://ganglia.sara.nl/addons/job_monarch/?c=GINA%20Cluster GINA joblist report]
+Nikhef shows some useful statistics:
+* [http://www.nikhef.nl/grid/stats/ndpf-prd/ Nikhef production facility statistics]
+External monitoring tools
+* [http://goc.grid.sinica.edu.tw/gstat/ gStat]
+* [http://gridportal.hep.ph.ic.ac.uk/rtm/ GridPP Real Time Monitor]
-Links
+VO specific monitoring
+* [http://opkamer.nikhef.nl/ VLemed dashboard]
-    * The Grid Tutorial 2006. https://gforge.vl-e.nl/docman/view.php/10/88/GridTutorial2006.pdf
+== Documentation ==
-    * to see status/load of matrix cluster, look at Ganglia tools
-          o Matrix cluster load for last hour
-          o Matrix joblist report
-    * to see system status (maintenance, etc.): http://www.sara.nl/systemstatus/systemstatus_eng.php3
+=== Links ===
-A common mistake in writing grid jobs (from personal experience): forgetting to send the executable script in the inputsandbox; forgetting to set the executable flag on the script.
+* Fokke Dijkstra, Jeroen Engelberts, Sjors Grijpink, David Groep, Arnold Meijster, Jeff Templon@nikhef.nl. ''[[Media:Grid Tutorial 2007.pdf|The Grid Tutorial handouts 2007]]''. This is the '''must read''' beginner's guide to the grid for VL-e users; it covers topics like
+** Grid security; getting a certificate, registering with a VO,
+** Job submission, and
+** Data Management.
+* Stephen Burke, Simone Campana, Antonio Delgado Peris, Flavia Donno, Patricia Méndez Lorenzo, Roberto Santinelli, Andrea Sciabà, [https://edms.cern.ch/file/722398//gLite-3-UserGuide.html gLite 3 User Guide]
+* [http://www.nikhef.nl/grid Nikhef Grid information]
+* [[NDPF Node Functions]]
+* [http://www.sara.nl/systemstatus/systemstatus_eng.php3 Sara system status]
+* [[NDPF News]] Nikhef Data Processing Facility news
+* [http://ca.dutchgrid.nl/ the DutchGrid and Nikhef Certification Authority]
+* [http://poc.vl-e.nl/distribution/quickstart/ HOWTO: Start using the grid, a quickstart tutorial] This quickstart guide is aimed at vlemed users, but it has lots of good stuff for other users, too.
+* ''[[Media:VLeMed-Silvia-26Sept2007.pdf|VLeMed fMRI presentation]]'' by Silvia D. Olabarriaga, presented on the Grid Tutorial 26-9-2007, Surfnet, Utrecht

Difference between revisions of "VL-e Resource Guide"

Latest revision as of 15:47, 15 February 2010

Contents

Computing

Running tests and debugging

Storage

LFC

Tape storage

about data replication

Grid Access

Monitoring

Documentation

Links

Navigation menu

Page actions

Page actions

Personal tools

Navigation

Search

Tools