Pegasus Cluster Job Scheduling

Pegasus is currently using the LSF resource manager to schedule all compute resources. LSF (load sharing facility) supports over 1500 users and over 200,000 simultaneous job submissions. Jobs are submitted to queues, the software categories we define in the scheduler to organize work more efficiently. LSF distributes jobs submitted by users to compute nodes according to queue, user priority, and available resources.

Pegasus queues are organized using limits like job size, job length, job purpose, and project. In general, users run jobs on Pegasus with equal resource shares. The more resource used (currently used or used recently) by a user, the lower priority will be applied when LSF assigns resources for new jobs from this user’s account. Parallel jobs are more difficult to schedule as they are inherently larger. Serial jobs can “fit into” the gaps left by larger jobs if serial jobs use short enough run time limits and small enough numbers of processors.

You may compile and test jobs on login nodes. However, any jobs exceeding 30 minutes of run time or using excessive resources on the login nodes will be terminated and the CCS account responsible for those jobs may be suspended.

LSF Batch System

LSF 9.1.1 Documentation

Batch jobs are self-contained programs that require no intervention to run. Batch jobs are defined by resource requirements such as how many cores, how much memory, and how much time they need to complete. A script file is one way to tell LSF your job requirements.

Common LSF commands and descriptions:

Command Purpose
bsub < ScriptFile Submits a job via script file to LSF to be run. NOTE: the redirection symbol, “<”, is a must when submitting the job
bjobs Displays your running and pending jobs.
bhist Displays historical information about your finished jobs.
bkill Removes/cancels a job or jobs from the class.
bqueues Shows the current configuration of queues.
bhosts Shows the load on each node.
bpeek Displays stderr and stdout from your unfinished job.

The command bsub < ScriptFile will submit the given script for processing. For more information about flags, type bsub -h at the Pegasus prompt. More detailed information can be display with man bsub. You must write a script containing the information LSF needs to allocate the resources your job requires, handle standard I/O streams, and run the job. Please see the example scripts below. On submission, LSF will return the job id which can be used to keep track of your job.

[username@pegasus ~]$ bsub < test.job
Job <4225> is submitted to general queue .

The commands bjobs displays information about your own pending, running, and suspended jobs.

[username@pegasus ~]$ bjobs
JOBID  USER   STAT  QUEUE    FROM_HOST  EXEC_HOST   JOB_NAME  SUBMIT_TIME
4225   usernam   RUN   general  m1       16*n060     testjob   Mar  2 11:53
                                         16*n061
                                         16*n063
                                         16*n064

For details about your particular job, issue the command bjobs -l jobID where jobID is obtained from the JOBID field of the above bjobs output. To display a specific user’s jobs, use bjobs -u username. To display all user jobs in paging format, pipe output to less:

[username@pegasus ~]$ bjobs -u all | less
JOBID     USER    STAT  QUEUE      FROM_HOST   EXEC_HOST   JOB_NAME   SUBMIT_TIME
5990529   axt651  RUN   interactiv login4.pega n002        bash       Feb 13 15:23
6010636   zxh69   RUN   general    login4.pega 16*n178     *acsjob-01 Feb 23 11:36
                                               16*n180
                                               16*n203
                                               16*n174
6014246   swishne RUN   interactiv n002.pegasu n002        bash       Feb 24 14:10
6017561   asingh  PEND  interactiv login4.pega             matlab     Feb 25 14:49
...

bhist displays information about your recently finished jobs. CPU time is not normalized in bhist output. To see your finished and unfinished jobs, use bhist -a.

bkill kills the last job submitted by the user running the command, by default. The command bkill jobID will remove a specific job from the queue and terminate the job if it is running. bkill 0 will kill all jobs belonging to current user.

[username@pegasus ~]$ bkill 4225
Job <4225> is being terminated

On Pegasus (Unix), SIGINT and SIGTERM are sent to give the job a chance to clean up before termination, then SIGKILL is sent to kill the job.

bqueues displays information about queues such as queue name, queue priority, queue status, job slot statistics, and job state statistics. CPU time is normalized by CPU factor.

[username@pegasus ~]$ bqueues
QUEUE_NAME      PRIO STATUS          MAX JL/U JL/P JL/H NJOBS  PEND   RUN  SUSP 
bigmem          500  Open:Active       -   16    -    -  1152  1120    32     0
visx            500  Open:Active       -    -    -    -     0     0     0     0
hihg            500  Open:Active       -    -    -    -     0     0     0     0
hpc             300  Open:Active       -    -    -    -  2561  1415  1024     0
debug           200  Open:Active       -    -    -    -     0     0     0     0
gpu             200  Open:Active       -    -    -    -     0     0     0     0
...
general         100  Open:Active       -    -    -    -  9677  5969  3437     0
interactive      30  Open:Active       -    4    -    -    13     1    12     0

bhosts displays information about all hosts such as host name, host status, job state statistics, and jobs lot limits. bhosts -s displays information about numeric resources (shared or host-based) and their associated hosts. bhosts hostname displays information about an individual host and bhosts -w displays more detailed host status. closed_Full means the configured maximum number of running jobs has been reached (running jobs will not be affected), no new job will be assigned to this host.

[username@pegasus ~]$ bhosts -w | less
HOST_NAME          STATUS       JL/U    MAX  NJOBS    RUN  SSUSP  USUSP    RSV 
n001               ok              -     16     14     14      0      0      0
n002               ok              -     16      4      4      0      0      0
...
n342               closed_Full     -     16     16     12      0      0      4
n343               closed_Full     -     16     16     16      0      0      0
n344               closed_Full     -     16     16     16      0      0      0

Use bpeek jobID to monitor the progress of a job and identify errors. If errors are observed, valuable user time and system resources can be saved by terminating an erroneous job with bkill jobID. By default, bpeek displays the standard output and standard error produced by one of your unfinished jobs, up to the time the command is invoked. bpeek -q queuename operates on your most recently submitted job in that queue and bpeek -m hostname operates on your most recently submitted job dispatched to the specified host. bpeek -f jobID display live outputs from a running job and it can be terminated by Ctrl-C (Windows & most Linux) or Command-C (Mac).

Example script for a serial Job

#!/bin/bash
#BSUB -J serialjob
#BSUB -P myproject
#BSUB -o %J.out
#BSUB -e %J.err
#BSUB -W 1:00
#BSUB -q general
#BSUB -n 1
#BSUB -R "rusage[mem=512]"
#BSUB -B
#BSUB -N
#BSUB -u example@miami.edu
#
# Run serial executable on 1 cpu of one node
# cd /nethome/jbaringer/example
cd ${HOME}/path/to/current/directory
./test.x a b c

Here is a detailed line-by-line breakdown of the keywords and their assigned values listed in this script:

#!/bin/bash
Specifies the shell to be used when executing the command portion of the script.
The default is Bash shell.

BSUB -J serialjob
assign a name to job. The name of the job will show in the bjobs output.

#BSUB -P myproject
specify the project to use when submitting the job. This is required when a user has more than one projects on pegasus.

#BSUB -e %J.err
redirect std error to a specified file

#BSUB -W 1:00
set wall clock run time limit of 1 hour, otherwise queue specific default run time limit will be applied.

#BSUB -q general
specify queue to be used. Without this option, default 'general' queue will be applied.

#BSUB -n 1
specify number of processors. In this job, a single processor is requested.

#BSUB -R "rusage[mem=512]"
specify that this job requests 512 megabytes of RAM. Without this, a default ram setting of 1.5GB will be applied.

#BSUB -B
send mail to specified email when the job is dispatched and begins execution.

#BSUB -u example@miami.edu
Send notification through email to example@miami.edu.

#BSUB -N
Send job statistics report through email when job finishes.

Example scripts for parallel jobs

Example script for Intel/Intel MPI

#!/bin/bash
#BSUB -J mpijob
#BSUB -o %J.out
#BSUB -e %J.err
#BSUB -W 1:30
#BSUB -q general
#BSUB -n 32
#BSUB -R "span[ptile=16]"
#

mpiexec foo.exe

foo.exe is the mpi executable name. It can be followed by its own argument list.

The ptile=16 argument requires the LSF job scheduler to allocate 16 processors per host. For optimum performance, all MPI jobs on Pegasus should use this flag to make sure all processors on a single host are used by this job. Otherwise, other jobs may be assigned to the same host. Parallel job performance may be affected, or even interrupted, by other badly-configured jobs running on the same host.

Example script for OpenMPI

#!/bin/bash
#BSUB -J mpijob
#BSUB -o %J.out
#BSUB -e %J.err
#BSUB -W 1:30
#BSUB -q general
#BSUB -n 32
#BSUB -R "span[ptile=16]"
#
mpiexec --mca btl self,sm,openib foo.exe

The command line is similar to Intel MPI job above. Option “--mca self,sm,openib” tells OpenMPI to use lookback,shared memory and openib for inter-process communication.

Running An Interactive Job

HPC clusters primarily take batch jobs and run them in the “background”; users do not need to interact with the job during the execution. However, sometimes users do need to interact with the application, for example, the application needs the input from the command line or waits for a mouse event in X windows. Use the bsub options -Is -q interactive for an interactive job, for example:

$ bsub -Is -q interactive matlab -nodisplay

or

$ bsub -Is -q interactive -XF $(java -jar ~/.local/apps/ImageJ/ij.jar -batch ~/.local/apps/ImageJ/macros/screenmill.txt)

Additionally, the interactive queue can run X11 jobs. The bsub -XF option is used for X11 jobs, for example:

$ bsub -q interactive -Is -XF matlab
Job <50274> is submitted to queue <interactive>.
<<ssh X11 forwarding job>>
<<Waiting for dispatch ...>>
<<Starting on n003.pegasus.edu>> 

Upon exiting the interactive job, you will be returned to one of the login nodes. If you are running an X11 application, you will need to establish an X tunnel with ssh when connecting to pegagus. For example,

ssh -X user@pegasus.ccs.miami.edu