cluster:Torque/257

From Dgiref
Jump to: navigation, search

Contents

Introduction

Geographylogo.png TORQUE (Tera-scale Open-source Resource and QUEue manager) is a resource manager providing control over batch jobs and distributed compute nodes. Maui is an advanced cluster scheduler capable of optimizing scheduling and node allocation decisions. It allows site administrators extensive control over which jobs are considered eligible for for scheduling, how the jobs are prioritized, and where these jobs are run. Maui supports advance reservations, QOS levels, backfill, and allocation management. Each of these features, if enabled, may require some adjustment on the part of the user to optimize system performance.

TORQUE and Maui are the parts of the batch system to submit jobs into the cluster. Batch systems are comprised of four different component types:

  1. a batch system will have a master node where pbs_server is running. Depending on the needs of the systems, a master node may be dedicated to this task or may fulfill the roles of other components as well.
  2. Submit/Interactive Nodes provide an entry point to the system for users to be able to manage their workload. For these nodes, users are able to submit and track their jobs. Additionally, some sites have one or more nodes reserved for interactive use, such as testing and troubleshooting environment problems. These nodes will have client commands (e.g., qsub, qhold, etc) available.
  3. Compute Nodes are the work horses of the system. Their role is to execute submitted jobs. On each compute node, pbs_mom will be running to start, kill and manage submitted jobs. It communicates with pbs_server on the master node.
  4. Some systems are organized for the express purpose of managing a collection of resources beyond compute nodes. Resources can include high-speed networks, storage systems, license managers, etc. Availability of these resources is limited and need to be managed intelligently to promote fairness and increased utilization.

The life cycle of a job can be divided into four stages (basic job flow)

  1. creation. Typically, a submit script is written to hold all of the job parameters. These parameters could include how long a job should run (i.e., wall time), what resources are necessary to run and what to execute.
  2. submission. A job is submitted with the command qsub. Once submitted, the policies set by the administration and technical staff of the site will dictate the priority of the job and therefore, when it will start executing.
  3. execution. Jobs often spend most of their life cycle executing. While a job is running, its status can be queried with qstat.
  4. finalization. When a job has completed, by default, the stdout and stderr files will be copied to the directory where the job was submitted.

Package:    torque v. 2.5.7 (MAUI v.326p21)
 os:             Scientific Linux version 5.6 64 bit
 server:        dgiref-batch.fzk.de
 manuals:  server / client / wn client


Archive links
Information links
Download links
Guidelines links



Please open a NGI-DE ticket if you experience any Installation or Configuration problem.

TORQUE server v.2.5.7

Prepare

Operating system
Scientific Linux version 5.6 64 bit

Configure the following settings for the host:

  • UMD repo
  • NFS-server
  • Create Host Based Authentication for Torque clients. See ssh auth
Note-icon.png
  
If you don't want to use the maui scheduler instead of the torque scheduler, exclude the maui packets installation and maui configuration from the scripts


Install

  • Install Torque-server with yum from umd repo
  • Install MAUI scheduler

Configure

  • Create default configuration for torque-server
  • Configure server name in /var/torque/server_name
  • Configure allowed WNs with number of CPUs into /var/torque/server_priv/nodes
  • Customization of Torque qmgr
    • Create queue
    • Create a server parameters (auth users, timouts, log level, torque managers, etc)
  • Create munge key for torque communication
  • Create maui scheduler configuration file

Proceed

Note-icon.png
  
Make pbs_server, munge, maui a services for OS and add it to autoboot system. So it will be started during boot.

Initial test

  • From a user account, it should be possible to use a 'Hello World' job submitting, as well as an interactive shell on a WN
  • The job results are as files STDIN.o<JOBID> (std-output) and STDIN.e<JOBID> (std-error).
  • Test MAUI
  • The test on the gLite-CE should work as edginfo user configuration of gLite-packages.
  • To check the status of the job query, the qstat command is used within the lifetime of submitted jobs.


Please open a NGI-DE ticket if you experience any Installation or Configuration problem.

TORQUE v.2.5.7 client

Prepare

Operating system
Scientific Linux version 5.6 64 bit

Configure the following settings for the host:

  • UMD repo
  • NFS-client
  • Create Host Based Authentication for Torque server. See ssh auth
Note-icon.png
  
Client for torque should be install on server for management torque

Install

  • Install Torque-server with yum from umd repo

Configure

  • Configure server name in /var/torque/server_name
  • Copy munge key for torque communication from torque-server

Proceed

  • Enable munge autoboot

Initial test

  • Create test script
  • Test submit job with qsub
  • Check queue and trace job

Update

update - is mostly to install new

Please open a NGI-DE ticket if you experience any Installation or Configuration problem.

TORQUE v.2.5.7 wn client

Prepare

Operating system
Scientific Linux version 5.6 64 bit

Configure the following settings for the host:

  • UMD repo
  • Create Host Based Authentication for Torque server and Middleware hosts. See ssh auth

Install

  • Install Torque-mom with yum from umd repo

Configure

Note-icon.png
  
Create hostname alias for Bach server in hosts file if use the separate internal network for communication between Batch server and WN
Note-icon.png
  
All nodes should be specified in /var/torque/server_priv/nodes on batch server host See configure batch server
  • Customization of mom config file
  • Prepare munge service

Proceed

  • Enable autoboot and start services
    • munge
    • pbs_mom

Initial test

  • Check information about WN from batch server
  • Start test job for WN