Enabling multicore jobs and jobs requesting large amounts of memory

From PDP/Grid Wiki
Revision as of 15:37, 10 April 2014 by Templon@nikhef.nl (talk | contribs) (→‎The submit filter)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigationJump to search

This article describes

Introduction

Certain applications will benefit from access to more than core (logical CPU) on the same physical computer. Grid jobs that use more than one core are referred to as multicore jobs.

Other applications require a specific amount of memory to run efficiently or successfully. Such jobs are called large-memory jobs in this article (because they often require a higher-than-default amount of memory on the machine).

The Cream Computing Elements (CreamCEs) offer support for multicore jobs and large-memory jobs, although they need some additional configuration to forward the job requirements to the batch system.

This article describes the support of multicore jobs or large-memory jobs at the Cream Computing Elements at Nikhef. Section System Configuration describes the setup of the system. In section Submitting multicore or (large) memory jobs, the information relevant to end users is presented.

The information presented here is valid for the UMD-1 version of the CreamCE in combination with a batch system based on Torque 2.3. Other versions of the CreamCE (in particular the nearly unsupported gLite 3.2 version) may require different configuration. Other versions of the Torque batch system may work fine, although that hasn't been verified. Different batch systems fall outside the scope of this article.

System Configuration

Two services are involved in the submission of grid jobs requiring multiple cores or specific amounts of memory: the CreamCE and the Torque batch system. The CreamCE is the entry point for a grid job at a site. The CreamCE processes the resource requests (passed via the JDL file) and translates them into a format that is specific for the batch system implementation. The batch system can then allocate the requested resources.

Setup at the CreamCE

To support multicore jobs or memory requests, the CreamCE must recognize these requests and translate them into directives for the Torque batch system. In the implementation discussed here, the CreamCE needs 3 files:

  • A script to process specific resource requests Media:pbs_local_submit_attributes.sh. This file should be installed as /usr/libexec/pbs_local_submit_attributes.sh on the CreamCE (for UMD-2 and UMD-3; it was /usr/bin for UMD-1). It processes memory resource requests, and requests for a minimum CPU or wall time.
  • A configuration file to activate this submit filter Media:torque.cfg. This file should be stored as /var/spool/pbs/torque.cfg.
  • A Torque submit filter to write the input for the batch system Media:torque_submit_filter-v2.pl. This file should be installed as /usr/local/sbin/torque_submit_filter.pl (actually: the location specified in file torque.cfg above as value for SUBMITFILTER). The submit filter is discussed below in more detail.

The submit filter

The submit filter processes the initial input for the batch system. It inspects the requested resources and may change them before writing the (possibly modified) input to standard output (which is forwarded to the batch server).

By default, the batch system allocates 1 core and (optionally) a certain amount of memory to each job. It is reasonable that a multicore job requesting N cores can also access N times the amount of memory of a single job.

For jobs requesting a certain amount of memory, the situation is the opposite. When a job requests M times the amount of memory of a default job, that means that M-1 other jobs cannot run on the same machine. From resource point of view, the large-memory job is thus equivalent to a job requesting M cores. In the resource usage accounting system, both types of jobs should be treated as if they are N or M simple jobs.

The submit filter presented here scales the amount of memory available to a multicore job with the number of requested cores. The amount of memory per core is obtained from the queue definition (see subsection Setup at the batch system), possibly capped by a maximum (which is also taken from the queue definition). Torque allows to define defaults and maximum limits for physical memory (option mem), physical memory per process (pmem), virtual memory (vmem), virtual memory per process (pvem). The submit script will scale mem and vmem with the number of cores; the per-process limits pvmem and pmem scale with the maximum number of cores on any of the allocated physical machines (in case more than one physical machine is used).

For large-memory jobs, the submit filter scales the number of allocated cores with the amount of requested memory divided by the amount of memory per single job; the number of cores is rounded up to the nearest integer. A large-memory jobs also gets more than one core allocated for computing.

After the number of cores has been established, the maximum CPU time available to the job will be computed. This limit is the product of the allocated number of cores and the default CPU time limit for a single core job.

Note if an incoming job requests both a certain amount of memory and also a certain amount of cores, the only thing that is scaled is the cpu time limit (with the number of cores).

Setup at the batch system

As described in the previous section, the submit filter will try to determine the default amount of memory per job slot and the CPU time limit per core from the queue definition. In addition, there may be limits to the amount of memory. These limits may be defined per queue in Torque via the parameters resources_default and resources_max:

set queue <QUEUE> resources_default.cput = <HH:MM:ss>
set queue <QUEUE> resources_default.mem = <N>mb
set queue <QUEUE> resources_max.cput = <HH:MM:ss>
set queue <QUEUE> resources_max.mem = <M>mb

The above may be extended with limits or defaults for vmem, pmem and pvmem, depending on site policies.

Note: although not required by Torque, the grid information system for certain VOs expects to find a CPU time limit. So not defining a CPU time limit will cause problems.

Current configuration

The current configuration used for most queues is similar to the following:

set queue medium resources_max.walltime = 36:00:00
set queue medium resources_default.cput = 35:38:24
set queue medium resources_default.pvmem = 4096mb

The maximum wall time guarantees a certain throughput of jobs per day and prevents jobs that are not properly functioning from hanging forever.

The default cpu time is set for a single-core jobs and is a bit lower than the maximum wall time. There is no maximum cpu time defined, although an effective maximum is computed by the submit filter (scaling the default with the number of allocated cores). Furthermore, each job gets assigned a selected number of cores on which it can execute processes, using the Linux taskset command, so there is no need for limiting the CPU time within the maximum wall time that the jobs can run.

The default pvmem is defined to limit virtual memory usage per process. The submit filter scales this defaultl with the number of allocated jobs slots. Therefore there is no need fr a maximum. The pvmem limit translates into an effective ulimit on the address space. When a job tries to allocate more memory than allowed by this limit, the allocation will fail and the job can (should!) deal with that exception in a graceful manner. The submitter can find the failures reason in the log files. Using a vmem limit instead would lead to a different response: the job would be killed by the batch system, which is rather unelegant and gives no feedback to the user about the reason why the job failed.

Submitting multicore or (large) memory jobs

Resource requirements are specified in the JDL file that is passed during submission to the WMS or CreamCE.

For submission to the Nikhef CEs, make sure the JDL file is submitted to Cream endpoint:

<CE>.nikhef.nl:8443/cream-pbs-flex

or make sure to submit to queue flex. This queue will only accept a limited number of jobs!

Note: jobs requesting multiple cores on the same physical machines (or equivalently, large amounts of memory) will typically wait much longer in the batch queue before they can run than single core jobs. On busy systems it will take some time before the scheduler can find all requested resources on one physical machine.

Note: the examples below are valid for the UMD-1 version of the CreamCE. They differ from the parameters that should be used when submitting to the gLite 3.2 version of the CreamCE!

Multicore jobs

Job requirements concerning the number of reserved cores and the number of cores on a physical machines can be specified via the attributes CpuNumber and SMPgranularity, respectively.

Example: to request 4 cores on 1 node, use the following statements in the JDL:

 SMPgranularity = 4;
 CpuNumber = 4;

Note: this will fail if the requested number of cores is larger than the number of cores present in the physical machine (actually: configured number of cores in the physical machine).

A similar construction can be used to request multiple cores on multiple physical machines. For example, M cores at multiple machines, requiring N cores per machine (where M > N):

 SMPgranularity = N;
 CpuNumber = M;

The batch system will try to find (M/N) physical machines, with N cores per machine. If M is not an integer multiple of N, the batch system will reserve int(M/N) machines with N cores and 1 machine with (M mod N) cores. Say M=8 and N=3, then there will be int(8/3)=2 machines with 3 cores and 1 machines with (8 mod 3)=2 cores.

Large-memory jobs

The JDL attribute CERequirements can be used to request a physical amount of memory on the node via the field other.GlueHostMainMemoryRAMSize. The unit is MB.

Example: to request at least 8 GB physical memory on a node, use the following statement in the JDL:

 CERequirements = "other.GlueHostMainMemoryRAMSize >= 8192";

Whole-node jobs

To request all resources on one node, the attribute WholeNodes should be set to true:

 WholeNodes = true;