cluster:Torque

From Dgiref

Jump to: navigation, search
TORQUE (Tera-scale Open-source Resource and QUEue manager) is a resource manager providing control over batch jobs and distributed compute nodes. Maui is an advanced cluster scheduler capable of optimizing scheduling and node allocation decisions. It allows site administrators extensive control over which jobs are considered eligible for for scheduling, how the jobs are prioritized, and where these jobs are run. Maui supports advance reservations, QOS levels, backfill, and allocation management. Each of these features, if enabled, may require some adjustment on the part of the user to optimize system performance.

TORQUE and Maui are the parts of the batch system to submit jobs into the cluster. Batch systems are comprised of four different component types:

  1. a batch system will have a master node where pbs_server is running. Depending on the needs of the systems, a master node may be dedicated to this task or may fulfill the roles of other components as well.
  2. Submit/Interactive Nodes provide an entry point to the system for users to be able to manage their workload. For these nodes, users are able to submit and track their jobs. Additionally, some sites have one or more nodes reserved for interactive use, such as testing and troubleshooting environment problems. These nodes will have client commands (e.g., qsub, qhold, etc) available.
  3. Compute Nodes are the work horses of the system. Their role is to execute submitted jobs. On each compute node, pbs_mom will be running to start, kill and manage submitted jobs. It communicates with pbs_server on the master node.
  4. Some systems are organized for the express purpose of managing a collection of resources beyond compute nodes. Resources can include high-speed networks, storage systems, license managers, etc. Availability of these resources is limited and need to be managed intelligently to promote fairness and increased utilization.

The life cycle of a job can be divided into four stages (basic job flow)

  1. creation. Typically, a submit script is written to hold all of the job parameters. These parameters could include how long a job should run (i.e., wall time), what resources are necessary to run and what to execute.
  2. submission. A job is submitted with the command qsub. Once submitted, the policies set by the administration and technical staff of the site will dictate the priority of the job and therefore, when it will start executing.
  3. execution. Jobs often spend most of their life cycle executing. While a job is running, its status can be queried with qstat.
  4. finalization. When a job has completed, by default, the stdout and stderr files will be copied to the directory where the job was submitted.

Package:    torque v. 2.3.6 (MAUI v.326p21)
 os:             Scientific Linux version 5.4 64 bit
 server:        dgiref-batch.fzk.de
 manuals:  server (maui) / client (maui)


Archive links
Information links
Download links
Guidelines links


Personal tools