cluster:Torque/236/server
From Dgiref
Contents |
TORQUE server v.2.3.6
Prepare
- Operating system
- Scientific Linux version 5.4 64 bit
| If you want to use the maui scheduler instead of the torque scheduler, do not start the TORQUE's pbs_sched daemon after the torque installation.
|
- Firewall configuration
Be sure that if you have firewalls running on the server or node machines that you allow connections on the appropriate ports for each machine. TORQUE pbs_mom daemons use UDP port 1023 and the pbs_server/pbs_mom daemons use ports 15001-15004 by default (how to open port in firewall).
| Firewall based issues are often associated with server to mom communication failures and messages such as 'premature end of message' in the log files. Also, the tcpdump program can be used to verify the correct network packets are being sent. |
administrator's script: prepare.sh
#!/bin/bash# prepare torque server# Declare the variables section ------------# Please insert your actual configuration# WNs=(WNs addresses)# np=number of processors pro WN# declare middleware servers# middleware=(declare middleware servers)# HOST=current host name# EDGINFO_UID=User id for edginfo as on GLITE server# RGMA_UID=User id for rgma as on GLITE server# from here ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~WNs=(dgiref-iwn01.fzk.de)
np=2
middleware=(sn06 sn04 sn03 sn02)
HOST=`hostname -f`
# check UIDs on glite machine for users# edginfo, rgmaEDGINFO_UID=101
RGMA_UID=102
# till here ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~#-> start routineREPO_URL="http://dgiref.d-grid.de/svn/dgiref/PROD/cf3/repl/repos/external/"
wget -O /etc/yum.repos.d/sl-dgiref.repo ${REPO_URL}/sl-dgiref.repo
# Create users for gLiteecho `useradd -u $EDGINFO_UID edginfo -d /localhome/edginfo`
echo `useradd -u $RGMA_UID rgma -d /localhome/rgma`
#<- end routine
Install
- Torque client and server should be installed on the torque server host (client includes qmgr).
administrator's script: install.sh
#!/bin/bash# install torque server# load parameters from prepare sectioncd `dirname $0`
source prepare.sh#-> start routineyum -y install torque-2.3.6-1cri.sl5 torque-server-2.3.6-1cri.sl5
yum -y install torque-client-2.3.6-1cri.sl5
#<-end routine
Configure
- Configure torque server
- Setup WNs addresses. Configuration file
/var/spool/pbs/server_priv/nodeson the torque server must hold all host names of the WNs information. - Fill out the configuration file
/etc/hosts.equivwith the list of the FQDNs for all Torque Clients, e.g. gLite-CE, Globus GRAM and Unicore NJS
- Setup WNs addresses. Configuration file
- Configure maui
- Create users for gLite:
edginfoundrgmawith the same UIDs like in gLite-CE. These users information are required to gLite with. The gLite-CE asks the MAUI server about this. - Create queues
- The D-Grid operational concept provides for the two queues:
dgiseqfor sequential anddgiparfor parallel jobs (the currently assigned queue limits are based on the values, as they are used in GridKa)
- The D-Grid operational concept provides for the two queues:
- Make NFS export /var/spool/pbs/server_logs
| do not forget to update the queues for every VO with
More: http://www.clusterresources.com/torquedocs21/4.1queueconfig.shtml |
administrator's script: configure.sh
#!/bin/bash# configure torque server# load parameters from prepare sectioncd `dirname $0`
source prepare.sh#-> begin routinefunction sendTo {
shift $(($OPTIND - 1))
file_name=${!#}
mv -f ${file_name} ${file_name}-`date +"%Y%m%d"`
touch ${file_name}
for host in $@
doif [ "$host" != "${file_name}" ]
thenif [ "${file_name}" == "/var/spool/pbs/server_priv/nodes" ]
thenecho "${host} np=${np}" >> ${file_name}
elseecho "${host}" >> ${file_name}
fifidone}# setup configuration filessendTo "${HOST}" "/var/spool/pbs/server_name"
# generate list of WNssendTo "${WNs[@]}" "/var/spool/pbs/server_priv/nodes"
# store list of middleware servers into /etc/hosts.equivsendTo "${middleware[@]}" "/etc/hosts.equiv"
# Create queuesecho `qmgr <<+
create queue dgiseq
set queue dgiseq queue_type = Executionset queue dgiseq Priority = 20
set queue dgiseq resources_max.cput = 48:00:00
set queue dgiseq resources_max.pcput = 48:00:00
set queue dgiseq resources_max.walltime = 96:00:00
set queue dgiseq resources_max.ncpus = 1
set queue dgiseq resources_max.nodect = 1
set queue dgiseq resources_default.cput = 48:00:00
set queue dgiseq resources_default.pcput = 48:00:00
set queue dgiseq resources_default.walltime = 96:00:00
set queue dgiseq resources_default.nodes = nodes=1:ppn=1
set queue dgiseq enabled = Trueset queue dgiseq started = Truecreate queue dgipar
set queue dgipar queue_type = Executionset queue dgipar Priority = 10
set queue dgipar resources_max.cput = 960:00:00
set queue dgipar resources_max.pcput = 48:00:00
set queue dgipar resources_max.ncpus = 40
set queue dgipar resources_max.nodect = 10
set queue dgipar resources_max.walltime = 96:00:00
set queue dgipar resources_default.ncpus = 1
set queue dgipar resources_default.cput = 96:00:00
set queue dgipar resources_default.pcput = 48:00:00
set queue dgipar resources_default.walltime = 96:00:00
set queue dgipar enabled = Trueset queue dgipar started = True+`#<- end routine
Proceed
To start / stop use the commands:
administrator's script: proceed.sh
#!/bin/bash# proceed# start serverecho `chkconfig pbs_server on`
echo `/etc/init.d/pbs_server start`
Initial test
- The result of the following command
qstat -Qshould be something like:
# Queue | Max | Tot | Ena | Str | Que | Run | Hld | Wat | Trn | Ext T # dgiseq | 0 | 0 | yes | yes | 0 | 0 | 0 | 0 | 0 | 0 E # dgipar | 0 | 0 | yes | yes | 0 | 0 | 0 | 0 | 0 | 0 E
administrator's script: test.sh
#!/bin/bash#-> start routine# verify all queues are properly configuredqstat -q# view additional server configurationqmgr -c 'p s'
# verify all nodes are correctly reporting# If everything Ok, you can see the all WNs details with the WN status "free".pbsnodes -a# submit a basic jobecho "sleep 30" | qsub
# verify jobs displayqstat
# check queues:qstat -Q# shutdown serverqterm -t quick#<- end routine
Update
- Upgrading TORQUE can generally be done without shutting down the whole cluster and disrupting running jobs. Simply build and install the new version and restart the daemons. Here is the safest procedure for upgrading TORQUE:
- Kill the scheduler.
- Wait a few minutes for all new jobs to complete startup.
- All running jobs in qstat -a have some elapsed walltime.
- Restart pbs_server.
- Verify the new pbs_server is working correctly.
- nodes should come up (not down or state-unknown)
- job walltimes should increase
- If upgrading from an earlier 2.1 build, MOMs can automatically restart themselves with:
- momctl -q enablemomrestart=1 -h :ALL
- Start the scheduler.
- If upgrading from 2.0 or earlier,
- Restart MOMs on all idle nodes.
- Wait a minute, make sure node and job states are updating correctly.
- Delete the previous static archive library files: (libattr.a, libcmds.a, liblog.a, libnet.a, libpbs.a, libsite.a
- Mark busy nodes offline.
- Start the scheduler.
- Restart MOMs on offline nodes after their jobs exit.
| All external software like maui, perl-PBS, or pbs_python built with the 2.0.x static archives will need to be rebuilt with the newer 2.1.x shared libraries. |
administrator's script: update.sh
#!/bin/bash# remove torque serveryum remove torque# update torque serveryum update torque