Stoomboot cluster

From CT Wiki
Jump to navigation Jump to search

The Stoomboot cluster is the local batch computing facility at Nikhef. It is accessible for users from scientific groups to perform for example data analysis or Monte Carlo calculations.

The cluster consists of 93 nodes with 8 cores each, running Scientific Linux CERN 6 as operating system. For a limited time, there are also 15 nodes (8 cores per node) running Scientific Linux CERN 5, to ease the transition to SLC6.

The Stoomboot cluster uses a combination of the Torque resource manager and the Maui scheduler. To interact with the batch system (e.g. submit a job, query a job status or delete a job), you need to login on a linux machine (either on the console or via ssh). Machines that can be used include desktops managed by the CT department, the login hosts and the interactive Stoomboot nodes.

Job Submission

The command qsub is used to submit jobs to the cluster. A typical use of this command is:

qsub [-q <queue>] [-l resource_name[=[value]][,resource_name[=[value]],...]] [script]

The optional argument script is the user-provided script that does the work. If no script is provided, the input is read from the console (STDIN).

Please read the section Queues below for more information about available queues and their properties. It is recommended to specify a queue.

For simple jobs, it usually not needed to provide the option -l with its resource list. However, if the job needs more than one core or node, or if the wall time limits of the job (maximum time that the job can exist) should be specified, this option should be used. Example:

  • -l nodes=1:ppn=4 requests 4 cores on 1 node
  • -l walltime=32:10:05 requests a wall time of 32 hours, 10 minutes and 5 seconds

More detailed information can be found in the manual page for qsub:

man qsub

Job Status

Users can look at the status of all current jobs with the command qstat:

Job id                    Name             User            Time Use S Queue
------------------------- ---------------- --------------- -------- - -----
1001.allier             user1           21:22:33 R stbcq          
1002.allier               myscript.csh     user2           08:15:38 C long          
1003.allier               myscript.csh     user2           00:02:13 R long
1004.allier               myscript.csh     user2                  0 Q long

The above example shows 4 jobs from 2 different users. Two jobs are running (1001 and 1003), one has finished (1002) and one is still waiting in the the queue.

More detailed output is shown with qstat -n1:

qstat -n1
                                                                         Req'd  Req'd   Elap
Job ID               Username Queue    Jobname          SessID NDS   TSK Memory Time  S Time
-------------------- -------- -------- ---------------- ------ ----- --- ------ ----- - -----
1001.allier.nikh     user1    stbcq   10280   --   --    --  36:00 R 21:22   stbc-081
1002.allier.nikh     user2    long         myscript.csh  28649   --   --    --  96:00 R 08:15   stbc-043
1003.allier.nikh     user2    long         myscript.csh  12365   --   --    --  96:00 R 00:02   stbc-028
1004.allier.nikh     user2    long         myscript.csh    --    --   --    --  96:00 R   --     --

More information can be found in the manual page:

man qstat


queue walltime [HH:MM] remarks
stbcq 09:00 gluster access; current default
generic 24:00
short 04:00
long 48:00 max. walltime of 96:00 via resource list
multicore 96:00 multicore jobs only
legacy 09:00 gluster access
iolimited 09:00 gluster access
express 00:10 intended for test jobs