Grid Engine User Guide

GridEngine User Guide

Table of Contents

Introduction
Setting Up Your Environment
GridEngine Script Generating Tool (GEST)
Basics of Using GE

ARL MSRC Filesystems
ARL GE Job Policies
Simple Job script
Submitting Jobs to GE
Syntax for GE Job Submission
Platform Complexes
CPU Time Complexes for Serial Jobs
Parallel Environemnts
Additional Info on Job Submission for Dedicated Nodes
Embedding qsub options in a script
Checking the status of your jobs

More Basics of using GE (optional)

Abaqus, MSI, Fluent and LS-Dyna License Tracking in GE
Interactive Jobs
GE Graphical User Interface: qmon

FAQ

Why does my job not start?
How do I launch MPI jobs?
Credentials error message
What queues can my job run in?
poe_file does not exist Error Message?
tcsh: Permission denied Error Message

Beyond the Basics in Using GE

More qsub options
Other useful GE commands
Tar your files to reduce file unmigration times
GE Support for Totalview Debugger for MPI Jobs on the IBM
GE Support for Debug Jobs
Special Feature for Parametric Study
Serial Post Processing Work
CTH Register for stop_now file

Advanced Use of GE

Introduction

GridEngine (GE) is the queuing system used at ARL on our older machines. GridEngine is an important interface between users and the HPC machines. Users use GE to submit jobs and check on their status. Currently, one IBM SP4 (shelton), and one Linux Cluster (powell) execute under the control of GE. Powell is available only under the HPC reservation system. You can reserve dedicated nodes on powell via the web-based HPC reservation system available from our main web page.

GE's job is to run each job as soon as possible with the resources the job requires. When a user submits a job, the user specifies what it needs in terms of machine type (IBM or Linux), CPU time, memory, etc. GE will determine when and where to run the job, based on the resources requested, as well as other factors such as the priority the project has in the GE share tree, queue configuration, machine load, current memory utilization, etc.

When a user submits a job, GE sometimes has multiple machines from which to choose. Since the user will not know at submittal time which machine the job will run on, the job script must be written so that it can execute on any of the machines that match the resources requested. Since the home filesystem is NFS mounted on all machines, this is easy to do, and Simple Job Script will explain how this is done.

First click on Setting up your environment for GE to see what to add to your initialization files to use GE. Then try our new Script Generator Tool (GEST). For basic information, click on the links under "Basic Use of GE" to see the fundamentals. Once you are comfortable with the first three sections under the "Basics", you are ready to run your batch jobs under GE. If you need to run an interactive job, or you wish to use the GUI (graphical user interface), then click on the appropriate link. When you are ready to learn more about GE, click on the links under "Beyond the Basics in using GE."

Setting up your environment for GE

Execute the following command appropriate for your shell. Copy this into your .cshrc (csh or tcsh users) or .profile (Korn or Bourne shell users) so it is automatically executed each time you login.

For csh:

if (-f /usr/msrc/modules/3.1.6/init/csh ) then

source /usr/msrc/modules/3.1.6/init/csh

module load Master modules

endif

For tcsh:

if (-f /usr/msrc/modules/3.1.6/init/tcsh ) then

source /usr/msrc/modules/3.1.6/init/tcsh

module load Master modules

endif

For sh:

if [ -f /usr/msrc/modules/3.1.6/init/sh ]

then

. /usr/msrc/modules/3.1.6/init/sh

module load Master modules

For ksh:

if [ -f /usr/msrc/modules/3.1.6/init/ksh ]

then

. /usr/msrc/modules/3.1.6/init/ksh

module load Master modules

To avoid problems with terminal settings in a batch job, add the following to your .login file before your 'stty' command, or any command referencing the 'term' variable:

if ( $?JOB_NAME ) then

exit

endif

Once you have sourced the settings file, your PATH and MANPATH environment variables are set up to execute GE commands or to get help on GE commands. To get help with any GE command, issue the command with the -help switch or use man with the command name to see the man page.

command -help

man command

The GE commands are stored in /usr/ge/bin. After you execute the above script (either settings.csh or settings.sh), you may execute "which qsub" and verify that you are getting a qsub from /usr/ge/bin.

ARL MSRC Filesystems

The ARL MSRC has several filesystems that GE users should be aware of. Each user is given a home filesystem on disk, plus an archive filesystem on joice/bob. The archive filesystem is used for storage of files before and after execution of jobs. In the job execution script, the input files should be copied to /usr/var/tmp/user_id/... since the temp directory /usr/var/tmp has the fastest I/O. After execution completes, the output files should be copied to the archive filesystem for safekeeping. See the sample script in these Web pages for an example on how to do this.

Attribute / Home Filesystem / Archive Filesystem / Execution Space
Pathname / /home / /archive / /usr/var/tmp
Hardware / Sun Fileservers / Sun Fileservers (joice, bob) / Local disks
HPC Machines Served / IBM, Linux / IBM, Linux / IBM, Linux
Type / Shared (NFS) / Shared (NFS) / Unique per machine
Purpose / Login; small file storage / Long term storage / Job Execution
Capacity / 1 Gbyte limit per user / No space limit / Several hundred Gbytes
File Lifetime / No time limit / No time limit / 14 days
File Migration / No / Yes / No
Pros / No delay in access, long term / Unlimited capacity, long term / Fastest access
Cons / Limited capacity / Delay in file demigration; slow access (NFS) / Limited capacity and lifetime, not backed up

ARL GE Job Policies

ARL has a limit of 100 on the quantity of running or pending jobs a user may have. The GE share tree insures that no one user can hog system resources at the expense of other users. ARL users may run jobs as follows:

standard project jobs up to 96 hours/process
Challenge jobs
debug jobs (10 minute per processor CPU limit) that start within minutes
background jobs up to 48 CPU hours/processor

Simple Job Script

The home filesystem is NFS mounted to all HPC machines for your convenience. Although NFS makes accessing files convenient, it does incur significant overhead, which slows down file accesses. Therefore, users should run their jobs in /usr/var/tmp, which provides much better I/O performance.

While the home filesystem is accessible on all machines, /usr/var/tmp is local to each machine and is distinct (i.e. /usr/var/tmp on one machine is NOT the same as /usr/var/tmp on another machine). GE creates a temporary directory for each job, and gives it a unique name. This directory, in /usr/var/tmp, can be referenced as $TMP or $TMPDIR. Or one can be established by the user, such as /usr/var/tmp/$LOGNAME/$JOB_ID). Using either of these temporary directories, a user's GE script should do the following:

copy input tarfile from $HOME to /usr/var/tmp/...
untar tarfile to create input files
execute in /usr/var/tmp/...
tar output files
copy output tarfile back to $HOME

The home filesystem has slower access, but provides a permanent storage place for files. /usr/var/tmp does not support long-term storage, but provides good I/O bandwidth for execution. The directory $TMP created by GE is removed at the end of the job. To avoid this, mkdir your own subdirectory in /usr/var/tmp/$LOGNAME.

A very simple GE script is:

#!/bin/csh

set TMPD=/usr/var/tmp/$LOGNAME/$JOB_ID

echo TMPD is $TMPD

mkdir -p $TMPD

cp input_tarfile $TMPD

cd $TMPD

tar xf input_tarfile

a.out

tar cf out_tar_file output1 results/output2 results/output3

cp output_tarfile /archive/army/thompson

echo job ended on `hostname` at `date`

When your GE script begins execution, it logs in as you, and executes your initialization files (.cshrc, .profile, .login). This puts your job in your home directory to start with. The fourth and fifth lines of the script above make a subdirectory and copy the input tarfile from there to the working directory $TMPD. This will give your job a unique directory to work in. Then the job changes to that directory, executes, and finally copies the output files back to the home filesystem for safekeeping. Many other things can be done in a script, but this basic script will run on any of the machines, and will allow for good I/O performance.

The $TMP variable defined by GE includes the GE job number, and is thus unique. Since it is unique, you can run multiple GE jobs, even on one machine, without worry that the files generated by one execution will be affected by the execution of another job. This directory is removed by GE at the end of the job, so be sure to copy output files to your home filesystem at the end of your script. If you define your temporary (work) directory as /usr/var/tmp/$LOGNAME/$JOB_ID, then this directory and the files will remain after the job completes. This is helpful in debugging, but once your script works, it is advisable to use $TMP so as to not leave unneeded files in /usr/var/tmp.

Submitting Jobs to GE

Jobs are submitted to GE using the qsub(1)command or via the qmon GUI. The GE Graphical User Interface qmon, which is explained in GE GUI qmon, has the same functionality as qsub.

The job scripts that are submitted must be on the host from which you are submitting the job (the submit host). Upon job submission, the script is saved by GE. So after you qsub a script, you are free to edit it and submit it again as another job. As soon as an execution host that meets the needs of your job is ready to run your job, the script is transferred to that host by GE and executed.

There is a small machine called qmaster that controls job initiation, job accounting, and many other GE functions. All submitted jobs are processed by qmaster, and it determines which execution host will run your job. There is no relationship between the machine you submit a job on and the machine it runs on. GE's qmaster will determine when and where to run your job based on data it receives from each machine about how busy the machine is (load factor), how much memory and swap space is currently used, how many jobs of the same type are already running on that machine (as well as globally), how your priority as a user compares to other pending jobs, etc.

Jobs are submitted in two ways, depending on whether they are serial (single CPU) or parallel. For all jobs, use the -l option to specify a complex, which tells GE what machine type, how much CPU time, and how much memory your job needs. For parallel jobs, use the -pe option to specify a parallel environment, followed by the number of processors:

Syntax for GE Job Submission:

When submitting a GE job, one must specify to GE what the job requires. Normally this includes platform type, and CPU time, though other items may also be specified. For parallel jobs, a PE (parallel environment) must be specified. The platform type (IBM, or Linux) is specified by a platform complex, and the CPU time is specified by a CPU time complex. Each complex is preceded by "-l". The PE is preceded by "-pe" and followed by the quantity of processors requested.

Serial job: qsub -l platform_complex -l cpu_time_complex script
Parallel job: qsub -l platform_complex -pe pe_name num_CPU script

The list of complexes and PE's are shown below. Most complexes allow jobs up to 4 GBytes, and there are special complexes for jobs that require more memory.

For example, to submit a 24 hour, 4 GByte, serial IBM job, use the following command:

qsub -l ibmp4 -l 24hr job_script

Also, there are priority queues available on all machines. These queues require special permission. Contact the ARL Helpdesk if you wish to use the priority queues.

Note: The values in the tables below change from time to time as we tune GE. To find the current values for a queue, execute:

qconf -sq queue_name

and look for h_data (memory) and h_cpu (CPU time in seconds).

Platform Complexes

Complex Name / Description
linux / to run on powell
ia32 / to run on powell
ibmp4 / to run on an IBM SP4

CPU Time Complexes for Serial Jobs

Complex Name / CPU Limit
4hr / 4 hours
12hr / 12 hours
24hr / 24 hours
48hr / 48 hours
96hr / 96 hours

Specification of a PE (parallel environment) is required for parallel jobs. The PE is specified in the qsub command by "-pe PE_name number_proc".

For example:

qsub -pe pe_24hr 12 -l ibm_p4 script

to run a shared memory job that requires up to 24 hours/processor and 12 CPUs on shelton.

PE (Parallel Environments)

New PE (Parallel Environments) are defined for GE effective April 4, 2001 to include the IBM and the HPCMO job categories of (1) Urgent, (2) Challenge, (3) Priority, (4) Standard, and (5) Background.

PE Category / PE Name / CPU Limit / Comments
Urgent / pe_urgent / custom / only for projects declared urgent by the HPCMP Director
Challenge / pe_chal_12hr / 12 hour / only for Challenge projects
Challenge / pe_chal_24hr / 24 hour / only for Challenge projects
Challenge / pe_chal_48hr / 48 hour / only for Challenge projects
Challenge / pe_chal_96hr / 96 hour / only for Challenge projects
Challenge / pe_chal_240hr / 240 hour / only for Challenge projects
Priority / pe_pri / custom / requires approval
Standard / pe_4hr / 4 hour / shared memory jobs
Standard / pe_12hr / 12 hour / shared memory jobs
Standard / pe_24hr / 24 hour / shared memory jobs
Standard / pe_48hr / 48 hour / shared memory jobs
Standard / pe_96hr / 96 hour / shared memory jobs
Standard / mpi_4hr_ibm_p4 / 4 hour / IBM MPI jobs (multinode)
Standard / mpi_12hr_ibm_p4 / 12 hour / IBM MPI jobs (multinode)
Standard / mpi_24hr_ibm_p4 / 24 hour / IBM MPI jobs (multinode)
Standard / mpi_48hr_ibm_p4 / 48 hour / IBM MPI jobs (multinode)
Standard / mpi_96hr_ibm_p4 / 96 hour / IBM MPI jobs (multinode)
Standard / mpi_4hr_ibm_p4_dnode / 4 hour / IBM MPI jobs (dedicated nodes)
Standard / mpi_12hr_ibm_p4_dnode / 12 hour / IBM MPI jobs (dedicated nodes)
Standard / mpi_24hr_ibm_p4_dnode / 24 hour / IBM MPI jobs (dedicated nodes)
Standard / mpi_48hr_ibm_p4_dnode / 48 hour / IBM MPI jobs (dedicated nodes)
Standard / mpi_96hr_ibm_p4_dnode / 96 hour / IBM MPI jobs (dedicated nodes)
Reservation / mpi_resv_glinux / per reservation / Linux MPI jobs
Reservation / mpi_resv_glinux_dnode / per reservation / Linux MPI jobs (dedicated nodes)
Reservation / mpi_resv_glinux_gcc / per reservation / Linux MPI jobs compiled with gcc
Background / pe_background / 48 hours / IBM shared memory jobs
Background / pe_background_mpi_ibm_p4 / 24 hours / IBM SP4 MPI jobs
Debug / pe_debug / 10 min / IBM jobs
Debug / mpi_debug_ibm_p4 / 10 min / IBM MPI debug jobs (limit 4 proc)
Interactive / pe_interactive / 4 or 12 hours / interactive job

The execution of qsub will return a message saying that your job has been submitted, and will give you the GE job number assigned to your job. Save this job number, since it will be used to reference your job. By default, the standard output and standard error files generated by your job have this GE job number as part of their names so that it will be easy to associate them with this particular job.

Additional Info on Job Submission for Dedicated Nodes

GE on the IBM or Linux Cluster allows the use of dedicated nodes by requesting one of the dedicated node parallel environments.

IBM SP4:

mpi_4hr_ibm_p4_dnode

mpi_12hr_ibm_p4_dnode

mpi_24hr_ibm_p4_dnode

mpi_48hr_ibm_p4_dnode

mpi_96hr_ibm_p4_dnode

Linux:

mpi_resv_glinux

mpi_resv_glinux_dnode

mpi_resv_glinux_gcc

When requesting dedicated nodes, the user must specify a multiple of the number of processors per node (32 for IBM SP4, or 2 for Linux) as the number of slots when requesting the PE on the qsub command. For example, to submit a job with a 4 hour limit to run on 4 dedicated IBM SP4 nodes, the user would specify:

qsub -pe mpi_4hr_ibm_p4_dnode 64 job_script

GE also supports several special environment variables which can be set when submitting a job to run with fewer MPI tasks than slots.

SGE_TOTAL_MPI_TASKS

the total number of MPI tasks to run in the job. The tasks will

be distributed in a round-robin fashion among the dedicated nodes

selected by GE for the job.

SGE_MPI_TASKS_PER_NODE

the number of MPI tasks to run on each dedicated node.

These environment variables can be specified on the qsub command line when submitting the job by using the qsub -v option. For example, to run 32 total MPI tasks with 64 total CPUs on 4 dedicated nodes, the user would specify:

qsub -v SGE_TOTAL_MPI_TASKS=32 -pe mpi_4hr_ibm_p4_dnode 64 job_script

To run with 'n' MPI tasks per dedicated node:

qsub -pe mpi_4hr_ibm_p4_dnode N -v SGE_MPI_TASKS_PER_NODE='n' job_script

The MPI tasks per node can also be specified in the script by including the following line in the script:

#$ -v SGE_MPI_TASKS_PER_NODE='n'

To run with 'n' total MPI tasks using N CPUs:

qsub -pe mpi_4hr_ibm_p4_dnode N -v SGE_TOTAL_MPI_TASKS='n' job_script

The total MPI task can also be specified in the script by including the following line in the script:

#$ -v SGE_TOTAL_MPI_TASKS='n'

Embedding qsub options in a script

GE supports the embedding of qsub options in the job script itself. This is done by using #$ as the first two characters in a line in your job script, followed by a valid qsub option. GE will read these options when the job is submitted and treat them as if you had specified them on the command line or in the GUI. This eliminates the need to type in the options each time a job is submitted, and it also provides a written record.

Below is an example job script which incorporates qsub -l options. If you specify resources in the script, be sure to not specify them in the qsub command, because incompatible resource requests will prevent your job from starting.