cluster:Torque/257
Contents |
Introduction
|
| |
|
TORQUE and Maui are the parts of the batch system to submit jobs into the cluster. Batch systems are comprised of four different component types:
The life cycle of a job can be divided into four stages (basic job flow)
|
|
TORQUE server v.2.5.7
Prepare
- Operating system
- Scientific Linux version 5.6 64 bit
Configure the following settings for the host:
- UMD repo
- NFS-server
- Create Host Based Authentication for Torque clients. See ssh auth
| If you don't want to use the maui scheduler instead of the torque scheduler, exclude the maui packets installation and maui configuration from the scripts |
administrator's script: prepare.sh
#!/bin/bashinside_network="10.0.171.0/255.255.255.0"
# install umd#clean oldrm /etc/yum.repos.d/UMD* /etc/yum.repos.d/epel*
wget http://download.fedoraproject.org/pub/epel/5/i386/epel-release-5-4.noarch.rpm
wget http://repository.egi.eu/sw/production/umd/1/sl5/x86_64/updates/umd-release-1.0.2-1.el5.noarch.rpm
rm -f epel-release-5-4.noarch.rpm umd-release-1.0.2-1.el5.noarch.rpm
yum -y install epel-release-5-4.noarch.rpm
yum -y install yum-priorities
yum -y install umd-release-1.0.2-1.el5.noarch.rpm
sed -i -e "s/priority=.*/priority=5/g" /etc/yum.repos.d/UMD-1-base.repo
sed -i -e "s/priority=.*/priority=4/g" /etc/yum.repos.d/UMD-1-updates.repo
#nfs-serveryum -y instal nfs-utils-lib nfs-utils
echo "/var/torque/server_logs $inside_network (rw,async,no_root_squash)" >> /etc/exports
echo "/var/torque/server_priv/accounting $inside_network (rw,async,no_root_squash)" >> /etc/exports
/etc/rc.d/init.d/nfslock start
/etc/rc.d/init.d/nfs start
Install
- Install Torque-server with yum from umd repo
- Install MAUI scheduler
administrator's script: install.sh
yum -y install torque.x86_64 torque-server
yum -y install maui maui-server maui-client
Configure
- Create default configuration for torque-server
- Configure server name in /var/torque/server_name
- Configure allowed WNs with number of CPUs into /var/torque/server_priv/nodes
- Customization of Torque qmgr
- Create queue
- Create a server parameters (auth users, timouts, log level, torque managers, etc)
- Create munge key for torque communication
- Create maui scheduler configuration file
administrator's script: configure.sh
#!/bin/bash/usr/sbin/pbs_server -t create
echo `hostname -f` > /var/torque/server_name
echo "wn01 np=2" > /var/torque/server_priv/nodes
echo "wn02 np=2" >> /var/torque/server_priv/nodes
echo "wn03 np=2" >> /var/torque/server_priv/nodes
echo "wn04 np=2" >> /var/torque/server_priv/nodes
echo "wn05 np=2" >> /var/torque/server_priv/nodes
echo "wn06 np=2" >> /var/torque/server_priv/nodes
echo "wn07 np=2" >> /var/torque/server_priv/nodes
echo "wn08 np=2" >> /var/torque/server_priv/nodes
echo "wn09 np=2" >> /var/torque/server_priv/nodes
echo "wn10 np=2" >> /var/torque/server_priv/nodes
# Create queuesecho `qmgr <<
'create queue dgiparset queue dgipar queue_type = Executionset queue dgipar max_queuable = 10set queue dgipar max_running = 5set queue dgipar acl_host_enable = falseset queue dgipar resources_max.walltime = 72:00:00set queue dgipar resources_default.walltime = 72:00:00set queue dgipar enabled = Trueset queue dgipar started = True## Create and define queue dgiseq#create queue dgiseqset queue dgiseq queue_type = Executionset queue dgiseq Priority = 100set queue dgiseq max_queuable = 150set queue dgiseq max_running = 130set queue dgiseq acl_host_enable = Falseset queue dgiseq resources_max.nodect = 6set queue dgiseq resources_max.walltime = 288:00:00set queue dgiseq resources_default.nodes = 5:ppn=2set queue dgiseq enabled = Trueset queue dgiseq started = True## Set server attributes.#set server acl_hosts = dgiref-batch.fzk.deset server managers = root@*set server managers += tomcat@dgiref-glite.fzk.deset server operators = root@*set server default_queue = dgiseqset server log_events = 511set server mail_from = adminset server query_other_jobs = Trueset server scheduler_iteration = 600set server node_check_rate = 150set server tcp_timeout = 12set server poll_jobs = Falseset server log_level = 3set server log_file_roll_depth = 7set server next_job_number = 2387set server authorized_users += *@dgiref-globus.fzk.deset server authorized_users += *@dgiref-login.fzk.deset server authorized_users += *@dgiref-ogsadai.fzk.deset server authorized_users += *@dgiref-batch.fzk.deset server authorized_users += *@dgiref-glite.fzk.deset server authorized_users += *@dgiref-unicore.fzk.deset server authorized_users += *@dgiref-dcache.fzk.deset server authorized_users += *@dgiref-glite32.fzk.deset server authorized_users += *@dgiref-globus50.fzk.deset server authorized_users += *@dgiref-globus40.fzk.de'`
# configure munge/usr/sbin/create-munge-key
# Generating a pseudo-random key using /dev/urandom completed.#chown munge:munge /var/log/munge/munged.log*#configure for mauiecho"
UI configuration example# @(#)maui.cfg David Groep 20031015.1# for MAUI version 3.2.5#SERVERHOST dgiref-batch.fzk.deADMIN1 rootADMINHOST sn05RMTYPE[0] PBSRMHOST[0] sn05RMSERVER[0] sn05SERVERPORT 40559SERVERMODE NORMAL# Set PBS server polling interval. Since we have many short jobs# and want fast turn-around, set this to 10 seconds (default: 2 minutes)RMPOLLINTERVAL 00:00:10# a max. 10 MByte log file in a logical locationLOGFILE /var/log/maui.logLOGFILEMAXSIZE 10000000LOGLEVEL 3">/var/spool/maui/maui.cfg
Proceed
| Make pbs_server, munge, maui a services for OS and add it to autoboot system. So it will be started during boot. |
administrator's script: proceed.sh
service munge restart
service maui restart
service pbs_server restart
chkconfig maui on
chkconfig munge on
chkconfig pbs_server on
chkconfig pbs_mom off
Initial test
- From a user account, it should be possible to use a 'Hello World' job submitting, as well as an interactive shell on a WN
- The job results are as files STDIN.o<JOBID> (std-output) and STDIN.e<JOBID> (std-error).
- Test MAUI
- The test on the gLite-CE should work as edginfo user configuration of gLite-packages.
- To check the status of the job query, the qstat command is used within the lifetime of submitted jobs.
administrator's script: test.sh
#!/bin/bash#-> start routine# verify all queues are properly configuredqstat -q# view additional server configurationqmgr -c 'p s'
# verify all nodes are correctly reporting# If everything Ok, you can see the all WNs details with the WN status "free".pbsnodes -a# submit a basic jobecho "sleep 30" | qsub
# verify jobs displayqstat
# check queues:qstat -Q# shutdown serverqterm -t quick#<- end routine
TORQUE v.2.5.7 client
Prepare
- Operating system
- Scientific Linux version 5.6 64 bit
Configure the following settings for the host:
- UMD repo
- NFS-client
- Create Host Based Authentication for Torque server. See ssh auth
| Client for torque should be install on server for management torque |
administrator's script: prepare.sh
# install umd#clean oldrm /etc/yum.repos.d/UMD* /etc/yum.repos.d/epel*
wget http://download.fedoraproject.org/pub/epel/5/i386/epel-release-5-4.noarch.rpm
wget http://repository.egi.eu/sw/production/umd/1/sl5/x86_64/updates/umd-release-1.0.2-1.el5.noarch.rpm
rm -f epel-release-5-4.noarch.rpm umd-release-1.0.2-1.el5.noarch.rpm
yum install epel-release-5-4.noarch.rpm
yum install yum-prioritiesyum install umd-release-1.0.2-1.el5.noarch.rpm
sed -i -e "s/priority=.*/priority=5/g" /etc/yum.repos.d/UMD-1-base.repo
sed -i -e "s/priority=.*/priority=4/g" /etc/yum.repos.d/UMD-1-updates.repo
Install
- Install Torque-server with yum from umd repo
administrator's script: install.sh
yum -y install torque-client.x86_64
# the emi torque package can be used as well# yum install emi-torque-client
Configure
- Configure server name in /var/torque/server_name
- Copy munge key for torque communication from torque-server
administrator's script: configure.sh
BATCH_SERVER=dgiref-batch.fzk.deecho -e $BATCH_SERVER > /var/torque/server_name
scp $BATCH_SERVER:/etc/munge/munge.key /etc/munge/munge.key
chown munge.munge /etc/munge/munge.key
chown munge:munge /var/log/munge/munged.log*
Proceed
- Enable munge autoboot
administrator's script: proceed.sh
service munge restart
chkconfig munge on
Initial test
- Create test script
- Test submit job with qsub
- Check queue and trace job
administrator's script: test.sh
# Tryecho "
#!/bin/bashsleep 60"> test.sh
qsub -q dgiseq test.sh# to check supported queues:# ldapsearch -x -H ldap://<CE_FQDN>:2170 -b mds-vo-name=resource,o=grid#To detect the jobid:qstat -Qqstat -a#To check the jobtracejob <jobid>
Update
update - is mostly to install new
administrator's script: update.sh
yum -y update torque-client.x86_64
TORQUE v.2.5.7 wn client
Prepare
- Operating system
- Scientific Linux version 5.6 64 bit
Configure the following settings for the host:
- UMD repo
- Create Host Based Authentication for Torque server and Middleware hosts. See ssh auth
administrator's script: prepare.sh
# install umd#clean oldrm /etc/yum.repos.d/UMD* /etc/yum.repos.d/epel*
wget http://download.fedoraproject.org/pub/epel/5/i386/epel-release-5-4.noarch.rpm
wget http://repository.egi.eu/sw/production/umd/1/sl5/x86_64/updates/umd-release-1.0.2-1.el5.noarch.rpm
rm -f epel-release-5-4.noarch.rpm umd-release-1.0.2-1.el5.noarch.rpm
yum install epel-release-5-4.noarch.rpm
yum install yum-prioritiesyum install umd-release-1.0.2-1.el5.noarch.rpm
sed -i -e "s/priority=.*/priority=5/g" /etc/yum.repos.d/UMD-1-base.repo
sed -i -e "s/priority=.*/priority=4/g" /etc/yum.repos.d/UMD-1-updates.repo
Install
- Install Torque-mom with yum from umd repo
administrator's script: install.sh
# the emi torque package can be used as wellyum -y install torque-client torque-mom
Configure
| Create hostname alias for Bach server in hosts file if use the separate internal network for communication between Batch server and WN |
| All nodes should be specified in /var/torque/server_priv/nodes on batch server host See configure batch server |
- Customization of mom config file
- Prepare munge service
administrator's script: configure.sh
#BATCH_SERVER=dgiref-batch.fzk.deBATCH_SERVER_INTERNAL_IP="10.0.171.205"
echo "$BATCH_SERVER_INTERNAL_IP $BATCH_SERVER" >> /etc/hosts
echo -e $BATCH_SERVER >> /var/torque/server_name
echo "\$pbsserver $BATCH_SERVER
\$logevent 255\$ideal_load 4\$max_load 10\$usecp *:/home /home\$usecp *:/srv/nfs/home /srv/nfs/home" > /var/torque/mom_priv/config
scp $BATCH_SERVER:/etc/munge/munge.key /etc/munge/munge.key
chown munge:munge /var/log/munge/munged.log*
chown munge:munge /etc/munge/munge.key
Proceed
- Enable autoboot and start services
- munge
- pbs_mom
administrator's script: proceed.sh
/etc/init.d/pbs_mom restart
service munge restart
chkconfig munge on
chkconfig pbs_mom on
Initial test
- Check information about WN from batch server
- Start test job for WN
administrator's script: test.sh
#wn="wn10"
#show information about wnqnodes -q $wn
# Tryecho "
#!/bin/bashsleep 60"> test.sh
qsub -q dgiseq test.sh# to check supported queues:# ldapsearch -x -H ldap://<CE_FQDN>:2170 -b mds-vo-name=resource,o=grid#To detect the jobid:qstat -Qqstat -a#To check the jobtracejob <jobid>