Using the Grid/Black hole

From BiGGrid Wiki
Jump to navigation Jump to search

Hight job failure rates

It may happen that your jobs fail at a high rate. This can have several causes; when in doubt contact support.

It might happen your jobs will land in queues on clusters. Some of these queues are only meant for small jobs which take a few minutes to compute.

In order to schedule on the shared grid resources the grid makes use of Wallclock time. If your jobs occupies a workernode for longer then the specified queue allows, your job will be aborted.

The solution is to estimate the expected wallclock time (the time your job will need to run) in the Job Description.

We advise you to include such a statement always!

You should approach this the following: 1. make an estimate of the runtime of your job in T minutes. 2. make sure your job lands in a queue that allows your job to run at least T minutes without getting aborted 3. do this by specifying the appropriate requirement in your jdl.

Requirements = (other.GlueCEPolicyMaxWallClockTime >= T)

Here it is stated that the job will require max 120 minutes; or better, you specify the requirement that you need a queue that allows your job to run 120 minutes

Requirements = (other.GlueCEPolicyMaxWallClockTime >= 120);

The above statement expresses the fact that the job will take more than 120 minutes to execute. It will therefore not end up in queues for short jobs.

You can see the available queues with lcg-info sites command.

lcg-infosites --vo yourVO ce

For queue wallclock configurations see: lcg-info

When in doubt you can make use of the glite-wms-job-match command to see how this works out.