TorqueMaui

From PDP/Grid Wiki
Revision as of 13:38, 13 December 2007 by Davidg@nikhef.nl (talk | contribs)
Jump to navigationJump to search

Installing a Torque and Maui cluster

Prerequisites

Installing the server

Install the following RPMs on the Torque/MAUI server:

  • torque
  • torque-docs
  • torque-server
  • torque-client (for testing)


And on the worker nodes you need only the Torque mom and PAM packages:

  • torque
  • torque-mom
  • torque-pam

And on the submission hosts (desktops and user interface system):

  • torque
  • torque-client
  • torque-docs

Configuring the server

An example configuration is below (for recent torque versions):

#
# Create and define queue test
#
create queue test
set queue test queue_type = Execution
set queue test resources_max.cput = 00:30:00
set queue test resources_max.walltime = 00:35:00
set queue test acl_group_enable = True
set queue test acl_groups = hadron
set queue test acl_groups += emin
set queue test acl_groups += bfys
set queue test keep_completed = 600
set queue test enabled = True
set queue test started = True
#
# Set server attributes.
#
set server scheduling = True
set server acl_host_enable = True
set server acl_hosts = localhost
set server acl_hosts += localhost.localdomain
set server acl_hosts += *.nikhef.nl
set server submit_hosts = *.nikhef.nl
set server managers = davidg@*.nikhef.nl
set server managers += tond@*.nikhef.nl
set server operators = andrevk@*.nikhef.nl
set server operators += a03@*.nikhef.nl
set server default_queue = test
set server log_events = 127
set server mail_from = root
set server query_other_jobs = False
set server scheduler_iteration = 60
set server node_ping_rate = 300
set server node_check_rate = 600
set server tcp_timeout = 6
set server default_node = farm
set server node_pack = False
set server job_stat_rate = 45
set server poll_jobs = True
set server log_level = 1
set server server_name = pinnacle.nikhef.nl

Configuring the pbs_mom (worker node)

This will use the shared NFS space to copy files around and directly back to the users server (instead of going through an extra scp step to the server):

$clienthost pinnacle.nikhef.nl
$tmpdir /tmp
$logevent 255
$pbsmastername pinnacle.nikhef.nl
$restricted pinnacle.nikhef.nl
$usecp *.nikhef.nl:/ /

Aditionally, pbs mom now comes with a PAM module that will allow users to login to a specific worker node, but only when they have a running job there (e.g. for diagnostics). This is enabled as follows

  • install the torque-pam RPM
  • add the following configuration to /etc/pam.d/sshd:
 account    sufficient   pam_pbssimpleauth.so debug
 account    required     pam_access.so

and in /etc/security/access.conf:

 -:ALL EXCEPT root @ct-b:ALL

Scheduling

The maui scheduler should be installed on the server node from the RPMs. The default configuration file is in /var/spool/maui/maui.cfg, and looks like:

# MAUI configuration example

SERVERHOST              pinnacle.nikhef.nl
ADMIN1                  root tond
ADMIN3                  andrevk
ADMINHOST               pinnacle.nikhef.nl
RMTYPE[0]           PBS
RMHOST[0]           pinnacle.nikhef.nl
RMSERVER[0]         pinnacle.nikhef.nl

RMCFG[0]            TIMEOUT=90
SERVERPORT            40559
SERVERMODE            NORMAL
RMPOLLINTERVAL        00:02:00
LOGFILE               /var/log/maui.log
LOGFILEMAXSIZE        50000000
LOGLEVEL              3
LOGFILEROLLDEPTH      30

NODESETPOLICY           ONEOF
NODESETATTRIBUTE        FEATURE
NODESETLIST             stoomboot
NODESETDELAY            0:00:00

NODESYNCTIME        0:00:30

NODEACCESSPOLICY            SHARED
NODEAVAILABILITYPOLICY      DEDICATED:PROCS
NODELOADPOLICY              ADJUSTPROCS
DEFERTIME                   0
JOBMAXOVERRUN               0
REJECTNEGPRIOJOBS           FALSE
FEATUREPROCSPEEDHEADER      xps

# Policies
BACKFILLPOLICY              ON
BACKFILLTYPE                FIRSTFIT
NODEALLOCATIONPOLICY        FASTEST
RESERVATIONPOLICY           CURRENTHIGHEST
RESERVATIONDEPTH            12

# Weights of various components in scheduling ranking calc

QUEUETIMEWEIGHT         0
XFACTORWEIGHT           1
XFACTORCAP         100000
RESWEIGHT              10

CREDWEIGHT             10
USERWEIGHT             10
GROUPWEIGHT            10

FSWEIGHT                1
FSUSERWEIGHT            1
FSGROUPWEIGHT          43
FSQOSWEIGHT           200

# FairShare
# use dedicated CPU ("wallclocktime used") metering
# decays over 24 "days"
FSPOLICY              DEDICATEDPES%
FSDEPTH               24
FSINTERVAL            24:00:00
FSDECAY               0.99
FSCAP                 100000

USERCFG[DEFAULT]    FSTARGET=1                    MAXJOBQUEUED=1024
GROUPCFG[DEFAULT]   FSTARGET=1    PRIORITY=1      MAXPROC=64

GROUPCFG[computer]  FSTARGET=1    PRIORITY=1000   MAXPROC=42

GROUPCFG[atlas]     FSTARGET=30    PRIORITY=100                 QDEF=lhcatlas
GROUPCFG[atlassrc]  FSTARGET=70    PRIORITY=100                 QDEF=lhcatlas
QOSCFG[lhcatlas]    FSTARGET=40                   MAXPROC=96

GROUPCFG[bfys]      FSTARGET=100   PRIORITY=100                 QDEF=lhclhcb
QOSCFG[lhclhcb]     FSTARGET=30                   MAXPROC=96

GROUPCFG[alice]     FSTARGET=100   PRIORITY=100                 QDEF=lhcalice
QOSCFG[lhcalice]    FSTARGET=10                   MAXPROC=64

GROUPCFG[theory]    FSTARGET=100   PRIORITY=100                 QDEF=niktheory
QOSCFG[niktheory]   FSTARGET=20                   MAXPROC=96

USERCFG[davidg]                   PRIORITY=80000
USERCFG[templon]                  PRIORITY=80000
USERCFG[ronalds]                  PRIORITY=80000
USERCFG[tond]                     PRIORITY=80000