Running Production Jobs

Overview of the Load Leveler Batch System

All production jobs on the IBM cluster is run using the LoadLeveler batch system. You can run interactive jobs (Ares class “pdebug”) to debug your problem. Or, unlike interactive jobs, batch jobs are controlled via scripts. These scripts tell the system which resources a job will require and how long they will be needed, and then, are submitted to the LoadLeveler queue manager to be processed. Load Leveler uses the name “class” instead of “queue”. The user’s job is submitted directly to the destination class. The user must be certain that the specific resources (such as number of nodes and runtime hours) requested are within the range offered by the classes. For a complete list of available classes in LoadLeveler, please refer to this table. Table 4 below lists frequently used LoadLeveler commands:

Table 4: Common Load Leveler commands and description
Command Purpose
llclass Lists Loadleveler classes (aka queues) for the system. (see also “llclass -H”)
llsubmit Submits a job to the LoadLeveler class to be run. (see also man “llsubmit -H”)
llq Displays jobs in the class. This includes jobs running, waiting, held, etc. (see also “llq -H”)
llcancel Removes a job or jobs from the class. (see also “llcancel -H”)
llstatus Displays the status of one or more nodes on the system. (see also “llstatus -H”)

Example Usage:

The command llclass lists available classes:


[user@ares_n01]:>llclass
Name                 MaxJobCPU     MaxProcCPU  Free   Max Description          
                    d+hh:mm:ss     d+hh:mm:ss Slots Slots                      
--------------- -------------- -------------- ----- ----- ---------------------
pinteractive          08:00:00       00:30:00    11   284 1 node / 16 procs Max
pdebug                08:00:00       00:30:00    11   564 1 node / 16 procs Max
plarge              3+22:00:00       04:00:00    11   568 35 nodes / 560 procs Max 
psmall              4+00:00:00     1+00:00:00    11   568 6 Nodes / 96 Procs Max
pmedium             5+08:00:00       12:00:00    11   568 16 nodes / 256 procs Max
plongrun            6+00:00:00     3+00:00:00    11   568 3 node / 48 procs Max
popenmp            32+00:00:00      undefined     7   142 1 node / 16 procs Max
--------------------------------------------------------------------------------
"Free Slots" values of the classes "pinteractive", "pdebug", "plarge", "psmall", 
"pmedium", "plongrun", "popenmp" are constrained by the MAX_STARTERS limit(s).

All available classes, pbigmem, pdata, pdebug, psmall, pmedium, plarge are listed along with a description and resource limits for each of them. Class “psmall” allows a maximum of 98 cpus for 8 hours at a time. Class “pmedium” allows a maximum of 256 cpus for 4 hours and class “plarge” allows 576 cpu jobs for 2 hours.

The command llsubmit LoadLeveler “script” will submit the given script for processing. You must write a script containing the information LoadLeveler needs to allocate the resources your job requires, to handle standard I/O streams, and to run the job. Please see the example scripts below.


[user@ares_n01]:>llsubmit test.job 
llsubmit: The job "ares_n01.ccs.miami.edu.487" has been submitted. 

The command llq will show all jobs currently running or queued on the system.


[user@ares_n01]:>llq
Id                       Owner      Submitted   ST PRI Class        Running on
------------------------ ---------- ----------- -- --- ------------ -----------
ares_n02.50.0            bkirtman    8/26 09:06 R  50  small        ares_n14
ares_n01.482.0           mmagaldi    8/26 09:22 R  50  small        ares_n04
ares_n01.483.0           milicak     8/26 11:58 R  50  small        ares_n10
ares_n01.471.0           jming       8/24 11:08 I  50  test
ares_n01.487.0           ashwanth    8/26 17:26 I  50  test
ares_n01.483.1           milicak     8/26 11:58 NQ 50  small

6 job step(s) in queue, 2 waiting, 0 pending, 3 running, 1 held, 0 preempted

For details about your particular job, issue the command llq -l jobid where jobid is obtained from the “Id” field of the llq output. The command llcancel job id where job id is obtained from the “Id” field of the llq output. This command will remove the job from the class and terminate the job if it is running.


[user@ares_n01]:>llcancel 487 
llcancel: Cancel command has been sent to the central manager. 

You must write a script containing the information LoadLeveler needs to allocate the resources your job requires, to handle standard I/O streams, and to run the job. Please see the example scripts below.

An example script for an MPI Job


    #!/bin/ksh
    # @ output = $(executable).$(jobid).out
    # @ error = $(executable).$(jobid).err
    # @ environment = COPY_ALL
    # @ notification  = never
    # @ network.mpi = sn_single,shared,US
    # @ node_usage = shared
    # @ wall_clock_limit = 00:30:00
    # @ job_type = parallel
    # @ node = 2
    # @ tasks_per_node = 8
    # @ class = psmall
    # @ queue

    ./a.out

Here is a line-by-line breakdown of the keywords and their assigned values listed in this MPI script:

#!/bin/ksh

Specifies the shell to be used when executing the command portion of the script. The default is korn shell.

#@ output = $(executable).$(jobid).out

Standard output from this job will be written to the specified file. The output filename is generated at runtime according to the definition given. For example, if the LoadLeveler script is named myjob, using the above format will create myjob.123.out as the output file (where 123 is the jobid.)

#@ error = $(executable).$(jobid).err

Standard error generated during runtime is written to a file different from the output file using the same syntax as described above.

#@ environment = COPY_ALL

Sets the given variable in the environment of the job. A number of variables could be specified this way. COPY_ALL specifies that all the environment variables from your shell be copied to the job.

#@ notification = never

LoadLeveler will “never” send email notification of events regarding this job. Other notification options include “always”, “error”, and “start”.

#@ network.mpi = sn_single,shared,US

Sets the communication protocol and network. For MPI jobs on Ares, you should always use these settings

#@ node_usage = shared

Specifies that once allocated, your job will share nodes with other jobs. Jobs may also be ‘shared’ on Ares which will significantly improve the system efficiency.

#@ wall_clock_limit = 00:30:00

Specifies the wall clock time requirements of your job. The value is specified in the HH:MM:SS format. This value must fall within the limits for the class specified (see llclass for current limits). Be aware that the specified wall_clock_limit must be long enough for the job to complete and jobs with shorter wall_clock_limits have a better chance of starting sooner (i.e. back filling) after waiting in the class.

#@ job_type = parallel

The parallel job type should always be used with MPI jobs.

#@ node = 2

Tells LoadLeveler that your job requires 2 nodes. This request must fall within the limits of the class with which the job will be run.

#@ tasks_per_node = 8

The product of tasks_per_node and the node value determines the total number of MPI processes to be created. Often, tasks_per_node will equal the number of physical processors per node, however, other scenarios exist. For instance, a hybrid MPI/OpenMP code might run with one MPI process per node, then spawn eight OpenMP threads per node to use all eight processors.

#@ class = small

Specifies the LoadLeveler class where the job will be run.

#@ queue

Required as the final LoadLeveler specification. This keyword queues your job. Any LoadLeveler key words following # queue are ignored.

./a.out

Runs the executable a.out from the specified path. In this case the path is the current working directory.

An example script for an OpenMP Job


    #!/bin/ksh

    #@ output = $(executable).$(jobid).out
    #@ error  = $(executable).$(jobid).err
    #@ environment = COPY_ALL
    #@ notification  = never
    #@ node_usage = shared
    #@ wall_clock_limit = 02:30:00
    #@ job_type = parallel
    #@ node = 1
    #@ tasks_per_node = 1
    #@ class = psmall
    #@ queue

    export OMP_NUM_THREADS=8

    ./a.out

Here is a line-by-line breakdown of the sample OpenMP script:

#!/bin/ksh

Specifies the shell to be used when executing the command portion of the script. The default is korn shell.

#@ output = $(executable).$(jobid).out

Standard output from this job will be written to the specified file. The output filename is generated at runtime according to the definition given. For example, if the loadleveler script was named myjob, using the above format will create myjob.123.out as the output file (where 123 is the jobid.)

#@ error = $(executable).$(jobid).err

Standard error generated during runtime is written to a file different from the output file using the same syntax as described above.

#@ environment = COPY_ALL

Sets the given variable in the environment of the job. A number of variables could be specified this way. COPY_ALL specifies that all the environment variables from your shell be copied to the job.

#@ notification = never

LoadLeveler will “never” send email notification of events regarding this job. Other notification options include “always”, “error”, and “start”.

#@ node_usage = shared

Specifies that once allocated, your job will share nodes with other jobs. Jobs may also be ‘shared’ on Ares which will significantly improve the system efficiency.

#@ wall_clock_limit = 02:30:00

Specifies the wall clock time requirements for your job. The value is specified in the HH:MM:SS format. This value must fall within the limits for the class specfied (see llclass for current limits). Be aware that the specified wall_clock_limit must be long enough for the job to complete and jobs with shorter wall_clock_limits have a better chance of starting sooner (i.e. back filling) after waiting in the queue.

#@ job_type = parallel

The serial or parallel job types can be used with threaded jobs.

#@ node = 1

Tells LoadLeveler that your job requires 1 node.

#@ tasks_per_node = 1

The tasks_per_node keyword indicates the number of processes to be started on the node. Pure OpenMP jobs have one process with multiple threads. We recommend OpenMP jobs to be run with node_usage = not_shared on Ares.

#@ class = small

Specifies the LoadLeveler class where the job will be run.

#@ queue

Required as the final LoadLeveler specification. This keyword queues your job. Any LoadLeveler key words following # queue are ignored.

export OMP_NUM_THREAD=8

Sets the number of OpenMP threads used to 8.

./a.out

Runs the executable a.out.
Note: Single process or SMP(OpenMP) programs are serial, MPI programs are parallel.

Job Chaining and Job Steps

A job chain is constructed by a job command file, which calls llsubmit at some stage of a running job to submit the next job of the chain. The command file of the next job in the chain may be constructed by the submitting job based on the results of the current job. An example is shown below:


   #!/bin/ksh
   #@ output = $(executable).$(jobid).out
   #@ error = $(executable).$(jobid).err
   #@ environment = $PATH; COPY_ALL;
   #@ notification  = never
   #@ node_usage = shared
   #@ wall_clock_limit=3600
   #@ job_type = serial
   #@ class = psmall
   #@ queue

    ./a.out

    llsubmit nextJob.cmd

Job chains are potentially dangerous, if there is no mechanism to prevent endless job chains, or to handle exceptions like accidental abort of the production commands executed by a job.

An alternative method to job chaining is job stepping which specifies multiple job steps to within a single script. Job stepping is used when individual resource requests (i.e. nodes and tasks_per_node) or job actions change during the execution of a sequence of distinct but interrelated jobs. The exit status of a previous job step can be used to influence the execution of a subsequent step, or the steps can be run independently. Job steps are setup with two Load Leveler key words: # @ step_name – the name of a job step and #@dependency- used to specify dependencies between job steps based on the return values and other attributes of previously run jobs.

Here is an example script of a multi-step job which does the following:

  • pre-processes input data using the ‘psmall’ class.
  • Runs a job using the ‘pmedium’ class.
  • post-processes outputs using the ‘psmall’ class.

#!/usr/bin/ksh
# @ step_name        = prep
# @ environment      = COPY_ALL; $PATH
# @ error            = $(executable).$(jobid).$(stepid).err
# @ output           = $(executable).$(jobid).$(stepid).out
# @ notification     = error
# @ job_type         = serial
# @ class            = psmall
# @ queue
#
# @ dependency       = ( prep == 0 )
# @ step_name        = analysis
# @ environment      = COPY_ALL; $PATH
# @ error            = $(executable).$(jobid).$(stepid).err
# @ output           = $(executable).$(jobid).$(stepid).out
# @ notification     = error
# @ job_type         = parallel
# @ node_usage       = not_shared
# @ node             = 2
# @ tasks_per_node   = 8
# @ network.MPI      = sn_all,shared,us
# @ class            = pmedium
# @ wall_clock_limit = 4:00:00
# @ queue
#
# @ dependency       = ( analysis == 0 )
# @ step_name        = post
# @ environment      = COPY_ALL; $PATH
# @ error            = $(executable).$(jobid).$(stepid).err
# @ output           = $(executable).$(jobid).$(stepid).out
# @ notification     = error
# @ job_type         = serial
# @ class            = psmall
# @ queue

# The environment variable "$LOADL_STEP_NAME" is set by loadleveler.
# The value of this variable is identical to the value of the loadleveler
# keyword 'step_name' for a particular step.  We use this variable to
# control which portion of the script gets executed for each step.
#

case "$LOADL_STEP_NAME" in
    "prep")
        # Preprocessing: Run the prep.com in the $WORKDIR to create input files
        #
        cd $WORKDIR/
        ./prep.com
        ;;
    "analysis")
        # analysis: Run the application in $WORKDIR and create output files.
        #
        cd $WORKDIR/
        ./run_analysis.com
        ;;
    "post")
        # Postprocessing: Run the post.com in $WORKDIR to postprocess output.
        #
         cd $WORKDIR
        ./post.com
        ;;
esac

## END OF FILE ########

Notice that the preprocessing, computation and postprocessing are all done by a single script, whereas traditional job chaining would require 3 scripts. When the script is submitted 3 steps appear in the queue as expected. While the first step (prep) is running the second and third steps wait in the NOT QUEUED state. If the first step exits successfully the next step (comp1) will begin running, however if the first step fails then the next two steps will not run because the dependencies cannot be met.