SLURM Job Scheduler

  • Faculty & Staff

Overview

Resource management and load balancing are controlled by GPC’s scheduler. Running a batch job on GPC begins with creating a wrapper script, followed by submitting to one of the queues. For some prepared templates for wrapper scripts for some of the most popular software packages on the GPC, navigate to /data/shared/scripts once you are logged in. These example scripts are currently still written for the previous scheduler, SGE, but can be adapted for SLURM by following the SGE to SLURM command conversions below.

Helpful Links

SLURM quickstart guide: https://slurm.schedmd.com/quickstart.html

SLURM command summary PDF: https://slurm.schedmd.com/pdfs/summary.pdf

SLURM manual pages: https://slurm.schedmd.com/man_index.html

Partitions

The following partitions are available on the GPC:

Available Partitions

Queue

Usage

normal

General jobs, 192GB memory available per node.

highmem

High memory jobs. 256GB memory available per node.

 

Interactive jobs can be run using 1 srun session at a time (default normal partition: "srun --pty bash"  or highmem: "srun --partition=highmem --pty bash") per user, which can run for no more than 24 hours. Users who need to use GPC resources for longer than 24 hours should do so by submitting a batch job to the scheduler using instructions on this page. To use the highmem partition, add this line to your job wrapper: #SBATCH -p highmem

 

Example Single-Thread Job Wrapper

#!/bin/bash

#SBATCH -J serial_job                                # Job name

#SBATCH --mail-type=END,FAIL                # Mail events (NONE, BEGIN, END, FAIL, ALL)

#SBATCH     # Where to send mail

#SBATCH --ntasks=1                                  # Run a single task, defaults to single CPU

#SBATCH --mem=1gb                                 # Job memory request per node

#SBATCH --time=10:00:00                          # Time limit hrs:min:sec

#SBATCH -o test."%j".out                            # Standard output to current dir

#SBATCH -e test."%j".err                             # Error output to current dir

 

# Enable Additional Software

. /etc/profile.d/modules.sh

module load yourRequiredModule(s)

 

# Run the job commands

./myprogram

Note: --mem flag specifies maximum memory per node. There are other ways to specify memory such as --mem-per-cpu. Make sure you only use one so they do not conflict.

 

 

Example Multi-Thread Job Wrapper

Note: Job must support multithreading through libraries such as OpenMP/OpenMPI and you must have those loaded via the appropriate module.

#!/bin/bash

#SBATCH -J parallel_job                             # Job name

#SBATCH --mail-type=END,FAIL                # Mail events (NONE, BEGIN, END, FAIL, ALL)

#SBATCH     # Where to send mail

#SBATCH --nodes=1                                   # Run all processes on a single node

#SBATCH --ntasks=1                                  # Run a single task

#SBATCH --cpus-per-task=4                       # Number of CPU cores per task

#SBATCH --mem=1gb                                 # Job memory request per node

#SBATCH --time=10:00:00                          # Time limit hrs:min:sec

#SBATCH -o test."%j".out                            # Standard output to current dir

#SBATCH -e test."%j".err                             # Error output to current dir

 

#When using OpenMP, may need to specify this env variable

#Best OpenMP performance is typically with this set to 1 or equal to cpus-per-task depending upon your particular program’s implementation

export OMP_NUM_THREADS=4

 

# Enable Additional Software

. /etc/profile.d/modules.sh

module load yourRequiredModule(s)

 

# Run the job commands

./myprogram

Note: --mem flag specifies maximum memory per node. There are other ways to specify memory such as --mem-per-cpu. Make sure you only use one so they do not conflict.

Note: In this OpenMP example, we use 1 task with 4 CPU threads since OpenMP treats jobs as one process with multiple threads. Some libraries such as Python multiprocess use multiple single thread processes, in which case 4 tasks, 1 CPU per task is best.

 

Example Multi-Node Job Wrapper

(Multiple single thread processes across multiple nodes)

Note: Job must support cross-node processes through libraries such as OpenMPI and you must have those loaded via the appropriate module.

Note: This example uses 24 cores/threads total (24 tasks,  1 cpu-per-task)

#!/bin/bash

#SBATCH -J parallel_multinode_job         # Job name

#SBATCH --mail-type=END,FAIL              # Mail events (NONE, BEGIN, END, FAIL, ALL)

#SBATCH   # Where to send mail

#SBATCH --ntasks=24                              # Number of MPI tasks (i.e. processes)

#SBATCH --cpus-per-task=1                    # Number of cores per MPI task 

#SBATCH --nodes=2                                # Maximum number of nodes to be allocated

#SBATCH --ntasks-per-node=12              # Maximum number of tasks on each node

#SBATCH --ntasks-per-socket=6             # Maximum number of tasks on each socket

#SBATCH --distribution=cyclic:cyclic       # Distribute tasks cyclically first among nodes and then among sockets within a node

#SBATCH --mem-per-cpu=600mb          # Memory per processor core

#SBATCH --time=10:00:00                      # Time limit hrs:min:sec

#SBATCH -o test."%j".out                        # Standard output to current dir

#SBATCH -e test."%j".err                         # Error output to current dir

 

# Enable Additional Software

. /etc/profile.d/modules.sh

module load yourRequiredModule(s)

 

# Run the job commands

./myprogram

 

 

Example Multi-Node, Multi-Thread Job Wrapper

(Multiple multi-threaded processes across multiple nodes)

Note: Job must support cross-node multithreading through libraries such as OpenMPI and OpenMP. You must have those loaded via the appropriate module.

Note: This example uses 32 cores/threads total (8 tasks, 4 cpus-per-task)

#!/bin/bash

#SBATCH -J parallel_multinode_job         # Job name

#SBATCH --mail-type=END,FAIL              # Mail events (NONE, BEGIN, END, FAIL, ALL)

#SBATCH   # Where to send mail

#SBATCH --ntasks=8                                # Number of MPI ranks

#SBATCH --cpus-per-task=4                    # Number of cores per MPI rank 

#SBATCH --nodes=2                                # Number of nodes

#SBATCH --ntasks-per-node=4                # How many tasks on each node

#SBATCH --ntasks-per-socket=2             # How many tasks on each CPU or socket

#SBATCH --mem-per-cpu=600mb           # Memory per core

#SBATCH --distribution=cyclic:cyclic       # Distribute tasks cyclically first among nodes and then among sockets within a node

#SBATCH --time=10:00:00                      # Time limit hrs:min:sec

#SBATCH -o test."%j".out                        # Standard output to current dir

#SBATCH -e test."%j".err                         # Error output to current dir

 

# Enable Additional Software

. /etc/profile.d/modules.sh

module load yourRequiredModule(s)

 

# Run the job commands

./myprogram

 

 

Example Array (Multiple Runs) Job Wrapper

#!/bin/bash

#SBATCH -J serial_array_job                      # Job name

#SBATCH --mail-type=END,FAIL                # Mail events (NONE, BEGIN, END, FAIL, ALL)

#SBATCH     # Where to send mail

#SBATCH --ntasks=1                                  # Run a single task, defaults to single CPU

#SBATCH --array=1-5                                 # Run 5 iterations of the job

#SBATCH --mem=1gb                                 # Job memory request per node

#SBATCH --time=10:00:00                          # Time limit hrs:min:sec

#SBATCH -o test."%j".out                            # Standard output to current dir

#SBATCH -e test."%j".err                             # Error output to current dir

 

# Enable Additional Software

. /etc/profile.d/modules.sh

module load yourRequiredModule(s)

 

# Run the job commands

./myprogram

Note: Array jobs will have a slightly different job ID notation in the form of JobID_ArrayID such as 12345_1, 12345_2, etc.

Note: Maximum number of simultaneous jobs can be specified with the % delimiter like this 1000 iteration, 5 at a time submission: sbatch --array [1-1000]%5 testarray.sh

Full array job documentation: https://slurm.schedmd.com/job_array.html 


Local Scratch Storage

Each compute node has ~1TB available for use as scratch space. I/O intensive jobs can move data onto nodes to speed up access during a job’s runtime. This is beneficial when your program needs to make many small reads/writes which would have high latency on shared network storage. Keep in mind there is some overhead from the initial copying of files into scratch to be utilized, it is not always the fastest choice. The performance benefit of local scratch vs network storage should be considered on a per job basis. The following can be added to your job wrapper to utilize scratch:

# Specify a few variables needed for this job

PROGRAM=$HOME/MyProgram/myprogram

DATASET=$HOME/MyProgram/datafile

SCRATCHDIR=/scratch/yourUsername/$SLURM_JOBID

 

# Check whether the scratch directory exists and create as needed

if [[ ! -d “$SCRATCHDIR” ]]

  then

mkdir -p $SCRATCHDIR

fi

 

# Check whether data is in scratch and copy as needed

if [[ ! -e “$SCRATCHDIR/datafile” ]]

 then

   cp $DATASET $SCRATCHDIR/datafile

   cp $PROGRAM $SCRATCHDIR/myprogram

fi

 

# Navigate to the scratch dir

cd $SCRATCHDIR

 

# Run our job commands from within the scratch dir

./myprogram -f datafile -o outfile

 

# Copy the output from the job commands to your homedir

# then delete the scratch dir

cp outfile $HOME/data/outputfile.$SLURM_JOBID

rm -rf $SCRATCHDIR

This job will create a scratch directory on the node that it runs on, copy data and the job into the scratch directory on the node, then copy the job output back to the network home directory. After the job completes, the temporary scratch directory is deleted.

 

 

SGE to SLURM Conversion
As of 2021, GPC has switched to the SLURM job scheduler from SGE. Along with this comes some new terms and a new set of commands. What were previously known as queues are now referred to as partitions, qsub is now sbatch, etc. Please see the tables below for the 1:1 conversion guide between SGE commands previously used on GPC and the SLURM commands to use currently.

 


Common job commands

Command

SGE

SLURM

Cluster status

-

sinfo

Job submission

qsub <job_script>

sbatch <job_script>

Start an interactive job

qlogin or qrsh

srun <args> --pty bash

Job deletion

qdel <job_ID>

scancel <job_ID>

Job status (all)

qstat or show

squeue

Job status by job

qstat -j <job_ID>

squeue -j <job_ID>

Job status by user

qstat -u <user>

squeue -u <user>

Job status detailed

qstat -j <job_ID>

scontrol show job <job_ID>

Show expected start time

qstat -j <job_ID>

squeue -j <job_ID> --start

Hold a job

qhold <job_ID>

scontrol hold <job_ID>

Release a job

qrls <job_ID>

scontrol release <job_ID>

Queue list / information

qconf -sql

scontrol show partition

Queue details

qconf -sq <queue>

scontrol show partition <queue>

Node list

qhost

scontrol show nodes

Node details

qhost -F <node>

scontrol show node <node>

X forwarding

qsh <args>

salloc <args> or srun <args> --pty

Monitor or review job resource usage

qacct -j <job_ID>

sacct -j <job_ID>

GUI

qmon

sview

Job submission options in scripts

Option

SGE (qsub)

SLURM (sbatch)

Script directive

#$

#SBATCH

Job name

-N <name>

--job-name=<name>

Standard output file

-o <file_path>

--output=<file_path>

Standard error file

-e <file_path>

--error=<file_path>

Combine stdout/stderr to stdout

-j yes

--output=<file_path>

Working directory

-wd <directory_path>

--workdir=<directory_path>

Request notification

-m <events>

--mail-type=<events>

Email address

-M <email_address>

--mail-user=<email_address>

Job dependency

-hold_jid [job_ID | job_name]

--dependency=after:job_JD[:job_JD...]

--dependency=afterok:job_JD[:job_JD...]

--dependency=afternotok:job_JD[:job_JD...]

--dependency=afterany:job_JD[:job_JD...]

Copy environment

-V

--export=ALL (default)

Copy environment variable

-v <variable[=value][,variable2=value2[,...]]>

--export=<variable[=value][,variable2=value2[,...]]>

Node count

-

--nodes=<count>

Request specific nodes

-l hostname=<node>

--nodelist=<node[,node2[,...]]>

--nodefile=<node_file>

Processor count per node

-pe <count>

--ntasks-per-node=<count>

Processor count per task

-

--cpus-per-task=<count>

Memory limit

-l mem_free=<limit>

--mem=<limit> (in mega bytes -MB)

Minimum memory per processor

-

--mem-per-cpu=<memory>

Wall time limit

-l h_rt=<seconds>

--time=<hh:mm:ss>

Queue

-q <queue>

--partition=<queue>

Request specific resource

-l resource=<velue>

--gres=gpu:<count> or --gres=mic:<count>

Job array

-t <array_indices>

--array=<array_indices>

Licences

-l licence=<licence_spec>

--licences=<licence_spec>

Assign job to the project

-P <project_name>

--account=<project_name>


Job Script Comparison:

 

SGE

SLURM

#!/bin/bash

#

#$ -N sge_test

#$ -j y

#$ -o test.output

# Current working directory

#$ -cwd

#$ -M

#$ -m bea

# Request for 8 hours run time

#$ -l h_rt=8:0:0

# Specify the project for job

#$ -P your_project_name_here

#

#$ -l mem=4G


echo "start job"

sleep 120

echo "bye"

#!/bin/bash

#

#SBATCH -J slurm_test

#SBATCH -o test.output

#SBATCH -e test.output

# Default in slurm

#SBATCH -D ./

#SBATCH --mail-user

#SBATCH --mail-type=ALL

# Request 8 hours run time

#SBATCH -t 8:0:0

# Specify the project for job

#SBATCH -A your_project_name_here

#

#SBATCH --mem=4000

 

echo "start job"

sleep 120

echo "bye"