cluster:Torque/236/server

From Dgiref
Jump to: navigation, search
Please open a NGI-DE ticket if you experience any Installation or Configuration problem.

Contents

TORQUE server v.2.3.6

Prepare

Operating system
Scientific Linux version 5.4 64 bit
Note-icon.png
  
If you want to use the maui scheduler instead of the torque scheduler, do not start the TORQUE's pbs_sched daemon after the torque installation.
Firewall configuration

Be sure that if you have firewalls running on the server or node machines that you allow connections on the appropriate ports for each machine. TORQUE pbs_mom daemons use UDP port 1023 and the pbs_server/pbs_mom daemons use ports 15001-15004 by default (how to open port in firewall).

Note-icon.png
  
Firewall based issues are often associated with server to mom communication failures and messages such as 'premature end of message' in the log files.

Also, the tcpdump program can be used to verify the correct network packets are being sent.

Install

  • Torque client and server should be installed on the torque server host (client includes qmgr).

Configure

  • Configure torque server
    • Setup WNs addresses. Configuration file /var/spool/pbs/server_priv/nodes on the torque server must hold all host names of the WNs information.
    • Fill out the configuration file /etc/hosts.equiv with the list of the FQDNs for all Torque Clients, e.g. gLite-CE, Globus GRAM and Unicore NJS
  • Configure maui
  • Create users for gLite: edginfo und rgma with the same UIDs like in gLite-CE. These users information are required to gLite with. The gLite-CE asks the MAUI server about this.
  • Create queues
  • The D-Grid operational concept provides for the two queues: dgiseq for sequential and dgipar for parallel jobs (the currently assigned queue limits are based on the values, as they are used in GridKa)
  • Make NFS export /var/spool/pbs/server_logs
Note-icon.png
  
do not forget to update the queues for every VO with
  • qmgr -c "set queue dgiseq acl_groups += ${vo_names}"
  • qmgr -c "set queue dgipar acl_groups += ${vo_names}"

More: http://www.clusterresources.com/torquedocs21/4.1queueconfig.shtml

Proceed

To start / stop use the commands:


Initial test

  • The result of the following command qstat -Q should be something like:
# Queue   |   Max  | Tot |  Ena |  Str  | Que  | Run  | Hld  | Wat  | Trn  | Ext T
# dgiseq  |     0  |   0 |  yes |  yes  |   0  |   0  |   0  |   0  |   0  |   0 E
# dgipar  |     0  |   0 |  yes |  yes  |   0  |   0  |   0  |   0  |   0  |   0 E

Update

Upgrading TORQUE can generally be done without shutting down the whole cluster and disrupting running jobs. Simply build and install the new version and restart the daemons. Here is the safest procedure for upgrading TORQUE:
  1. Kill the scheduler.
  2. Wait a few minutes for all new jobs to complete startup.
    1. All running jobs in qstat -a have some elapsed walltime.
  3. Restart pbs_server.
  4. Verify the new pbs_server is working correctly.
    1. nodes should come up (not down or state-unknown)
    2. job walltimes should increase
  5. If upgrading from an earlier 2.1 build, MOMs can automatically restart themselves with:
    1. momctl -q enablemomrestart=1 -h :ALL
    2. Start the scheduler.
  6. If upgrading from 2.0 or earlier,
    1. Restart MOMs on all idle nodes.
    2. Wait a minute, make sure node and job states are updating correctly.
    3. Delete the previous static archive library files: (libattr.a, libcmds.a, liblog.a, libnet.a, libpbs.a, libsite.a
    4. Mark busy nodes offline.
    5. Start the scheduler.
    6. Restart MOMs on offline nodes after their jobs exit.
Note-icon.png
  
All external software like maui, perl-PBS, or pbs_python built with the 2.0.x static archives will need to be rebuilt with the newer 2.1.x shared libraries.


Personal tools