Stoomboot cluster

From CT Wiki
Jump to: navigation, search

The Stoomboot cluster is the local batch computing facility at Nikhef. It is accessible for users from scientific groups to perform for example data analysis or Monte Carlo calculations.

The Stoomboot facility consists of 3 interactive nodes and a batch cluster with 93 nodes with 8 cores each, running mostly the Scientific Linux CERN 6 operating system. Currently a few batch nodes run centos7. The fraction of centos 7 nodes will increase as demand for computing on centos 7 boxes increases. The dCache storage system us accessible for any of the interactive and worker nodes.

Contents

Interactive nodes

There are three interactive nodes: stbc-i1, stbc-i2 and stbc-i4. Each of these machines is equipped with 64 GB of RAM and 16 cores. They run Scientific Linux CERN 6 as operating system.

These machines are intended for the following purposes:

  • Running interactive jobs, like analysis work (making plots etc).
  • Interaction with the batch system (see below for the relevant commands).

They are not suitable for very resource (memory or CPU) intensive work because they are shared with other users. If your program needs a lot of memory, or needs to run for a long time, and/or heavily uses multiple cores, you should use the batch nodes instead.

If, for some reason, have to run a long CPU-intensive job on the interactive nodes (i.e., you cannot run it on the batch system), please be friendly to your fellow users of the system and run your jobs with low CPU priority using the nice command:

nice -20 <your script or program and its arguments>

This way, your program gets a lower priority than other programs, which means it will take a bit longer to complete for you, but makes the system more responsive for other users.

You can login to these machines via ssh.

Batch cluster

The Stoomboot batch cluster uses a combination of the Torque resource manager and the Maui scheduler. To interact with the batch system (e.g. submit a job, query a job status or delete a job), you need to login on a linux machine (either on the console or via ssh). Machines that can be used include desktops managed by the CT department, the login hosts and the interactive Stoomboot nodes.

Job Submission

The command qsub is used to submit jobs to the cluster. A typical use of this command is:

qsub [-q <queue>] [-l resource_name[=[value]][,resource_name[=[value]],...]] [script]

The optional argument script is the user-provided script that does the work. If no script is provided, the input is read from the console (STDIN).

Please read the section Queues below for more information about available queues and their properties. It is recommended to specify a queue.

For simple jobs, it usually not needed to provide the option -l with its resource list. However, if the job needs more than one core or node, or if the wall time limits of the job (maximum time that the job can exist) should be specified, this option should be used. Example:

  • -l nodes=1:ppn=4 requests 4 cores on 1 node
  • -l walltime=32:10:05 requests a wall time of 32 hours, 10 minutes and 5 seconds

More detailed information can be found in the manual page for qsub:

man qsub

Submitting under another group

By default, qsub uses the primary group of the Unix account when submitting a job. Sometimes it may be needed to use a different group, for example because the primary group has no allocation on the batch server. To submit with a different group, add -W group_list=<new-group> to qsub. The following example forces the use of group atlas:

qsub -W group_list=atlas some-script.sh

Job Status

Users can look at the status of all current jobs with the command qstat:

qstat
Job id                    Name             User            Time Use S Queue
------------------------- ---------------- --------------- -------- - -----
1001.burrell              script.sh        user1           21:22:33 R generic6          
1002.burrell              myscript.csh     user2           08:15:38 C long6          
1003.burrell              myscript.csh     user2           00:02:13 R long6
1004.burrell              myscript.csh     user2                  0 Q long6

The above example shows 4 jobs from 2 different users. Two jobs are running (1001 and 1003), one has finished (1002) and one is still waiting in the the queue.

More detailed output is shown with qstat -n1:

qstat -n1
                                                                         Req'd  Req'd   Elap
Job ID                  Username Queue       Jobname          SessID NDS   TSK Memory Time  S Time
--------------------    -------- --------    ---------------- ------ ----- --- ------ ----- - -----
1001.burrell.nikhef.n   user1    generic6    script.sh        10280   --   --    --   36:00 R 21:22   stbc-081
1002.burrell.nikhef.n   user2    long6       myscript.csh     28649   --   --    --   96:00 R 08:15   stbc-043
1003.burrell.nikhef.n   user2    long6       myscript.csh     12365   --   --    --   96:00 R 00:02   stbc-028
1004.burrell.nikhef.n   user2    long6       myscript.csh       --    --   --    --   96:00 R   --     --

More information can be found in the manual page:

man qstat

Job Deletion

Users can delete their own jobs with the command qdel:

qdel 1001

This example removes the job with ID 1001. The ID can be obtained with the qstat command described above.

More information can be found in the manual page:

man qdel

Queues

The queues are divided into queues running on slc6 nodes, queues running on centos 7 and routing queues that currently send jobs to the slc6 queues.

SLC6 queues walltime [HH:MM] remarks
express6 00:10 test jobs (max 2 running jobs per user)
short6 04:00
generic6 24:00 default queue
long6 48:00 max. walltime of 96:00 via resource list


centos7 queues walltime [HH:MM] remarks
express7 00:10 test jobs (max 2 running jobs per user)
short7 04:00
generic7 24:00 default queue
long7 48:00 max. walltime of 96:00 via resource list


routing queues destination queue
express express6
short short6
generic generic6
long long6


General Considerations on submission and scheduling

The system is a shared facility, mistakes you make can cause others to be blocked or lose work, so pay attention to some considerations.

  • If you'll be doing a lot of I/O, think about it a bit (or come talk to us in the PDP group). Some example gotchas:
    • large output files (like logs) going to /tmp might fill up /tmp if many of your jobs land on the same node, and this will hang the node.
    • submitting lots of jobs, each of which opens lots of files, can cause problems on the storage server. Organizing information in thousands of small files is problematic.
  • there is an inherent dead time of about 30 seconds associated with each job. Hence very short jobs are horrendously inefficient, they also cause high loads on several services and this sometimes causes problems for others. If your average job run time is not at least 1 minute, please consider how to re-pack your work into jobs.
  • the system does not handle a combination of multicore and single core scheduling in a single queue, this is why there is a multicore queue. Do not submit single core work to the multicore queue (actually you're only allowed to use it by request to the admin team) and do not submit multicore work to any other queue.
  • make sure that your program actually does what it says it does, with respect to cores ... sometimes a job will try to grab all the cores it sees, automatically. Such jobs should be submitted to the multicore queue, or else turn the feature off so that it only accesses a single core.

How scheduling is done

Or, some things to think about if you don't understand why your jobs are not running, or there are less running than you expected.

The system works on a fair-share scheduling basis. Each group at Nikhef gets an equal share allocated, and within each group, all users are equal. The scheduler makes its decision based on a couple pieces of information:

  • how much time has your group (eg atlas or virgo or km3net) used over the last 8 hours
  • how much time have you (eg templon or verkerke or stanb) used over the last 8 hours.

The group number is converted to a group ranking component ... if your group has used less than the standard share in the last 8 hours, this number is positive, getting larger the less the group has used. If you've used more than the standard share, the number is negative, getting more negative the more you've used. The algorithm is absolutely the same for all groups at Nikhef. There is a similar conversion for the user number, the scale of the group number being larger than the group one. The two components are added, resulting in a ranking ... the jobs that have the highest ranking run first. So jobs are essentially run in this order:

  • low group usage in the past 8 hours, also low user usage
  • low group usage in the past 8 hours, higher user usage
  • higher group usage in the past 8 hours, lower user usage
  • higher group usage in the past 8 hours, higher user usage

The system is fair in that it gives priority to jobs from users/groups that haven't used many cycles in the last 8 hours. However the response is not always immediate, the scheduler can't run a new job until another one ends, and if there are more than two groups running, it could be that your particular group has higher ranking than one but lower than the other.

The scheduling for the multicore queue and the rest of the (single core) queues is independent. There is one more consideration in scheduling, this is that the maximum number of running jobs is limited per queue and per user. Some example reasons why: the long queue is limited to running 312 single-core jobs, not even half of the total capacity; this is done to prevent long jobs from blocking all other use of the cluster for a long (yeah, that's why the queue is called long) period of time. Also, despite the generic queue being able to run 390 jobs max, a single user cannot run more than 342 jobs, this is to prevent a single user from grabbing all the slots in the generic queue for herself. The numbers are a compromise; if the cluster is empty, you want these numbers to be big, if its full, you want them to be smaller, there is no good way to automate this so we set them at a compromise.

If somebody submits multicore work to a single core queue, this has the potential to block all other users from the same group, due to a limitation in the scheduling software, so please don't submit multicore work to single-core queues.

Views
Personal tools