cluster:Torque/216/server

From Dgiref
Jump to: navigation, search

See also troubleshooting for this page.


Please open a NGI-DE ticket if you experience any Installation or Configuration problem.

Contents

Torque/216/server

Prepare

Operating system
Scientific Linux version 4.5 64 bit

Optimizing the configuration:


Use minimal operating system installation without firewall. To verify installed packages use the command

  • rpm -qa | grep package_name

Install the following additional packages:

  • yum -y install wget yum rpm make gcc gcc-c++ tar sed zlib openssl

After the installation is complete, turn off any unnecessary services (like gpm, sendmail, cups, haldaemon, messagebus, pcmcia, anacron, atd) with the following command:

  • chkconfig <SERVICE> off

Configure the following settings for the server:


Firewall configuration

Be sure that if you have firewalls running on the server or node machines that you allow connections on the appropriate ports for each machine. TORQUE pbs_mom daemons use UDP port 1023 and the pbs_server/pbs_mom daemons use ports 15001-15004 by default (how to open port in firewall).


Note-icon.png
  
Firewall based issues are often associated with server to mom communication failures and messages such as 'premature end of message' in the log files.

Also, the tcpdump program can be used to verify the correct network packets are being sent.

Install

The D-Grid reference installation has the Torque batch system for jobs management. To install the Torque batch system:

  1. Torque server installation (on which the pbs_server and maui scheduler are running)
  2. Torque clients installation (gLite-CE, Globus GRAM und UNICORE NJS) which sent a Batch-jobs
  3. Worker Nodes (cluster, on which the pbs_mom daemon is running)

WARNING: The three middleware frontends for Job Submission (gLite-CE, Globus GRAM and Unicore NJS) must be configured as torque clients with MAUI. The following packages are required in each of the three front ends (unless specifically stated, the steps as root):

  • torque
  • torque-client
  • maui
  • maui-client

Unless specifically stated, the installation and configuration steps as root user.

Selecting the cluster scheduler is an important decision and significantly affects cluster utilization, responsiveness, availability, and intelligence. The default TORQUE scheduler, pbs_sched, is very basic and will provide poor utilization of your cluster’s resources. Other options, such as Maui Scheduler is highly recommended. If using pbs_sched, start this daemon now. If using Maui, refer to the next installation steps.


Configure

  • Configure torque server
    • Setup WNs addresses. Configuration file /var/spool/pbs/server_priv/nodes on the torque server must hold all host names of the WNs information.
    • Fill out the configuration file /etc/hosts.equiv with the list of the FQDNs for all Torque Clients, e.g. gLite-CE, Globus GRAM and Unicore NJS
  • Configure maui
  • Create users for gLite: edginfo und rgma with the same UIDs like in gLite-CE. These users information are required to gLite with. The gLite-CE asks the MAUI server about this.
  • Create queues
  • The D-Grid operational concept provides for the two queues: dgiseq for sequential and dgipar for parallel jobs (the currently assigned queue limits are based on the values, as they are used in GridKa)
  • Make NFS export /var/spool/pbs/server_logs


Proceed

To start / stop use the commands:


Initial test

The pbs_server daemon was started on the TORQUE server when the torque.setup file was executed or when it was manually configured. It must now be restarted so it can reload the updated configuration changes.


Update

Upgrade
Upgrading TORQUE can generally be done without shutting down the whole cluster and disrupting running jobs. Simply build and install the new version and restart the daemons. Here is the safest procedure for upgrading TORQUE:
  1. Kill the scheduler.
  2. Wait a few minutes for all new jobs to complete startup.
    1. All running jobs in qstat -a have some elapsed walltime.
  3. Restart pbs_server.
  4. Verify the new pbs_server is working correctly.
    1. nodes should come up (not down or state-unknown)
    2. job walltimes should increase
  5. If upgrading from an earlier 2.1 build, MOMs can automatically restart themselves with:
    1. momctl -q enablemomrestart=1 -h :ALL
    2. Start the scheduler.
  6. If upgrading from 2.0 or earlier,
    1. Restart MOMs on all idle nodes.
    2. Wait a minute, make sure node and job states are updating correctly.
    3. Delete the previous static archive library files: (libattr.a, libcmds.a, liblog.a, libnet.a, libpbs.a, libsite.a
    4. Mark busy nodes offline.
    5. Start the scheduler.
    6. Restart MOMs on offline nodes after their jobs exit.
Note-icon.png
  
All external software like maui, perl-PBS, or pbs_python built with the 2.0.x static archives will need to be rebuilt with the newer 2.1.x shared libraries.


Personal tools