Overview
Resource management and load balancing are controlled by GPC’s scheduler. Running a batch job on GPC begins with creating a wrapper script, followed by submitting to one of the queues. For some prepared templates for wrapper scripts for some of the most popular software packages on the GPC, navigate to /data/shared/scripts once you are logged in. These example scripts are currently still written for the previous scheduler, SGE, but can be adapted for SLURM by following the SGE to SLURM command conversions below.
Helpful Links
SLURM quickstart guide: https://slurm.schedmd.com/quickstart.html
SLURM command summary PDF: https://slurm.schedmd.com/pdfs/summary.pdf
SLURM manual pages: https://slurm.schedmd.com/man_index.html
Partitions
The following partitions are available on the GPC:
Available Partitions
Queue |
Usage |
normal |
General jobs, 192GB memory available per node. |
highmem |
High memory jobs. 256GB memory available per node. |
Interactive jobs can be run using 1 srun session at a time (default normal partition: "srun --pty bash" or highmem: "srun --partition=highmem --pty bash") per user, which can run for no more than 24 hours. Users who need to use GPC resources for longer than 24 hours should do so by submitting a batch job to the scheduler using instructions on this page. To use the highmem partition, add this line to your job wrapper: #SBATCH -p highmem
Example Single-Thread Job Wrapper
#!/bin/bash #SBATCH -J serial_job # Job name #SBATCH --mail-type=END,FAIL # Mail events (NONE, BEGIN, END, FAIL, ALL) #SBATCH # Where to send mail #SBATCH --ntasks=1 # Run a single task, defaults to single CPU #SBATCH --mem=1gb # Job memory request per node #SBATCH --time=10:00:00 # Time limit hrs:min:sec #SBATCH -o test."%j".out # Standard output to current dir #SBATCH -e test."%j".err # Error output to current dir
# Enable Additional Software . /etc/profile.d/modules.sh module load yourRequiredModule(s)
# Run the job commands ./myprogram |
Note: --mem flag specifies maximum memory per node. There are other ways to specify memory such as --mem-per-cpu. Make sure you only use one so they do not conflict.
Example Multi-Thread Job Wrapper
Note: Job must support multithreading through libraries such as OpenMP/OpenMPI and you must have those loaded via the appropriate module.
#!/bin/bash #SBATCH -J parallel_job # Job name #SBATCH --mail-type=END,FAIL # Mail events (NONE, BEGIN, END, FAIL, ALL) #SBATCH # Where to send mail #SBATCH --nodes=1 # Run all processes on a single node #SBATCH --ntasks=1 # Run a single task #SBATCH --cpus-per-task=4 # Number of CPU cores per task #SBATCH --mem=1gb # Job memory request per node #SBATCH --time=10:00:00 # Time limit hrs:min:sec #SBATCH -o test."%j".out # Standard output to current dir #SBATCH -e test."%j".err # Error output to current dir
#When using OpenMP, may need to specify this env variable #Best OpenMP performance is typically with this set to 1 or equal to cpus-per-task depending upon your particular program’s implementation export OMP_NUM_THREADS=4
# Enable Additional Software . /etc/profile.d/modules.sh module load yourRequiredModule(s)
# Run the job commands ./myprogram |
Note: --mem flag specifies maximum memory per node. There are other ways to specify memory such as --mem-per-cpu. Make sure you only use one so they do not conflict.
Note: In this OpenMP example, we use 1 task with 4 CPU threads since OpenMP treats jobs as one process with multiple threads. Some libraries such as Python multiprocess use multiple single thread processes, in which case 4 tasks, 1 CPU per task is best.
Example Multi-Node Job Wrapper
(Multiple single thread processes across multiple nodes)
Note: Job must support cross-node processes through libraries such as OpenMPI and you must have those loaded via the appropriate module.
Note: This example uses 24 cores/threads total (24 tasks, 1 cpu-per-task)
#!/bin/bash #SBATCH -J parallel_multinode_job # Job name #SBATCH --mail-type=END,FAIL # Mail events (NONE, BEGIN, END, FAIL, ALL) #SBATCH # Where to send mail #SBATCH --ntasks=24 # Number of MPI tasks (i.e. processes) #SBATCH --cpus-per-task=1 # Number of cores per MPI task #SBATCH --nodes=2 # Maximum number of nodes to be allocated #SBATCH --ntasks-per-node=12 # Maximum number of tasks on each node #SBATCH --ntasks-per-socket=6 # Maximum number of tasks on each socket #SBATCH --distribution=cyclic:cyclic # Distribute tasks cyclically first among nodes and then among sockets within a node #SBATCH --mem-per-cpu=600mb # Memory per processor core #SBATCH --time=10:00:00 # Time limit hrs:min:sec #SBATCH -o test."%j".out # Standard output to current dir #SBATCH -e test."%j".err # Error output to current dir
# Enable Additional Software . /etc/profile.d/modules.sh module load yourRequiredModule(s)
# Run the job commands ./myprogram |
Example Multi-Node, Multi-Thread Job Wrapper
(Multiple multi-threaded processes across multiple nodes)
Note: Job must support cross-node multithreading through libraries such as OpenMPI and OpenMP. You must have those loaded via the appropriate module.
Note: This example uses 32 cores/threads total (8 tasks, 4 cpus-per-task)
#!/bin/bash #SBATCH -J parallel_multinode_job # Job name #SBATCH --mail-type=END,FAIL # Mail events (NONE, BEGIN, END, FAIL, ALL) #SBATCH # Where to send mail #SBATCH --ntasks=8 # Number of MPI ranks #SBATCH --cpus-per-task=4 # Number of cores per MPI rank #SBATCH --nodes=2 # Number of nodes #SBATCH --ntasks-per-node=4 # How many tasks on each node #SBATCH --ntasks-per-socket=2 # How many tasks on each CPU or socket #SBATCH --mem-per-cpu=600mb # Memory per core #SBATCH --distribution=cyclic:cyclic # Distribute tasks cyclically first among nodes and then among sockets within a node #SBATCH --time=10:00:00 # Time limit hrs:min:sec #SBATCH -o test."%j".out # Standard output to current dir #SBATCH -e test."%j".err # Error output to current dir
# Enable Additional Software . /etc/profile.d/modules.sh module load yourRequiredModule(s)
# Run the job commands ./myprogram |
Example Array (Multiple Runs) Job Wrapper
#!/bin/bash #SBATCH -J serial_array_job # Job name #SBATCH --mail-type=END,FAIL # Mail events (NONE, BEGIN, END, FAIL, ALL) #SBATCH # Where to send mail #SBATCH --ntasks=1 # Run a single task, defaults to single CPU #SBATCH --array=1-5 # Run 5 iterations of the job #SBATCH --mem=1gb # Job memory request per node #SBATCH --time=10:00:00 # Time limit hrs:min:sec #SBATCH -o test."%j".out # Standard output to current dir #SBATCH -e test."%j".err # Error output to current dir
# Enable Additional Software . /etc/profile.d/modules.sh module load yourRequiredModule(s)
# Run the job commands ./myprogram |
Note: Array jobs will have a slightly different job ID notation in the form of JobID_ArrayID such as 12345_1, 12345_2, etc.
Note: Maximum number of simultaneous jobs can be specified with the % delimiter like this 1000 iteration, 5 at a time submission: sbatch --array [1-1000]%5 testarray.sh
Full array job documentation: https://slurm.schedmd.com/job_array.html
Local Scratch Storage
Each compute node has ~1TB available for use as scratch space. I/O intensive jobs can move data onto nodes to speed up access during a job’s runtime. This is beneficial when your program needs to make many small reads/writes which would have high latency on shared network storage. Keep in mind there is some overhead from the initial copying of files into scratch to be utilized, it is not always the fastest choice. The performance benefit of local scratch vs network storage should be considered on a per job basis. The following can be added to your job wrapper to utilize scratch:
# Specify a few variables needed for this job PROGRAM=$HOME/MyProgram/myprogram DATASET=$HOME/MyProgram/datafile SCRATCHDIR=/scratch/yourUsername/$SLURM_JOBID
# Check whether the scratch directory exists and create as needed if [[ ! -d “$SCRATCHDIR” ]] then mkdir -p $SCRATCHDIR fi
# Check whether data is in scratch and copy as needed if [[ ! -e “$SCRATCHDIR/datafile” ]] then cp $DATASET $SCRATCHDIR/datafile cp $PROGRAM $SCRATCHDIR/myprogram fi
# Navigate to the scratch dir cd $SCRATCHDIR
# Run our job commands from within the scratch dir ./myprogram -f datafile -o outfile
# Copy the output from the job commands to your homedir # then delete the scratch dir cp outfile $HOME/data/outputfile.$SLURM_JOBID rm -rf $SCRATCHDIR |
This job will create a scratch directory on the node that it runs on, copy data and the job into the scratch directory on the node, then copy the job output back to the network home directory. After the job completes, the temporary scratch directory is deleted.
SGE to SLURM Conversion
As of 2021, GPC has switched to the SLURM job scheduler from SGE. Along with this comes some new terms and a new set of commands. What were previously known as queues are now referred to as partitions, qsub is now sbatch, etc. Please see the tables below for the 1:1 conversion guide between SGE commands previously used on GPC and the SLURM commands to use currently.
Common job commands
Command |
SGE |
SLURM |
Cluster status |
- |
sinfo |
Job submission |
qsub <job_script> |
sbatch <job_script> |
Start an interactive job |
qlogin or qrsh |
srun <args> --pty bash |
Job deletion |
qdel <job_ID> |
scancel <job_ID> |
Job status (all) |
qstat or show |
squeue |
Job status by job |
qstat -j <job_ID> |
squeue -j <job_ID> |
Job status by user |
qstat -u <user> |
squeue -u <user> |
Job status detailed |
qstat -j <job_ID> |
scontrol show job <job_ID> |
Show expected start time |
qstat -j <job_ID> |
squeue -j <job_ID> --start |
Hold a job |
qhold <job_ID> |
scontrol hold <job_ID> |
Release a job |
qrls <job_ID> |
scontrol release <job_ID> |
Queue list / information |
qconf -sql |
scontrol show partition |
Queue details |
qconf -sq <queue> |
scontrol show partition <queue> |
Node list |
qhost |
scontrol show nodes |
Node details |
qhost -F <node> |
scontrol show node <node> |
X forwarding |
qsh <args> |
salloc <args> or srun <args> --pty |
Monitor or review job resource usage |
qacct -j <job_ID> |
sacct -j <job_ID> |
GUI |
qmon |
sview |
Job submission options in scripts
Option |
SGE (qsub) |
SLURM (sbatch) |
Script directive |
#$ |
#SBATCH |
Job name |
-N <name> |
--job-name=<name> |
Standard output file |
-o <file_path> |
--output=<file_path> |
Standard error file |
-e <file_path> |
--error=<file_path> |
Combine stdout/stderr to stdout |
-j yes |
--output=<file_path> |
Working directory |
-wd <directory_path> |
--workdir=<directory_path> |
Request notification |
-m <events> |
--mail-type=<events> |
Email address |
-M <email_address> |
--mail-user=<email_address> |
Job dependency |
-hold_jid [job_ID | job_name] |
--dependency=after:job_JD[:job_JD...] --dependency=afterok:job_JD[:job_JD...] --dependency=afternotok:job_JD[:job_JD...] --dependency=afterany:job_JD[:job_JD...] |
Copy environment |
-V |
--export=ALL (default) |
Copy environment variable |
-v <variable[=value][,variable2=value2[,...]]> |
--export=<variable[=value][,variable2=value2[,...]]> |
Node count |
- |
--nodes=<count> |
Request specific nodes |
-l hostname=<node> |
--nodelist=<node[,node2[,...]]> --nodefile=<node_file> |
Processor count per node |
-pe <count> |
--ntasks-per-node=<count> |
Processor count per task |
- |
--cpus-per-task=<count> |
Memory limit |
-l mem_free=<limit> |
--mem=<limit> (in mega bytes -MB) |
Minimum memory per processor |
- |
--mem-per-cpu=<memory> |
Wall time limit |
-l h_rt=<seconds> |
--time=<hh:mm:ss> |
Queue |
-q <queue> |
--partition=<queue> |
Request specific resource |
-l resource=<velue> |
--gres=gpu:<count> or --gres=mic:<count> |
Job array |
-t <array_indices> |
--array=<array_indices> |
Licences |
-l licence=<licence_spec> |
--licences=<licence_spec> |
Assign job to the project |
-P <project_name> |
--account=<project_name> |
Job Script Comparison:
SGE |
SLURM |
#!/bin/bash # # #$ -N sge_test #$ -j y #$ -o test.output # Current working directory #$ -cwd #$ -M #$ -m bea # Request for 8 hours run time #$ -l h_rt=8:0:0 # Specify the project for job #$ -P your_project_name_here # #$ -l mem=4G echo "start job" sleep 120 echo "bye" |
#!/bin/bash # #SBATCH -J slurm_test #SBATCH -o test.output #SBATCH -e test.output # Default in slurm #SBATCH -D ./ #SBATCH --mail-user #SBATCH --mail-type=ALL # Request 8 hours run time #SBATCH -t 8:0:0 # Specify the project for job #SBATCH -A your_project_name_here # #SBATCH --mem=4000
echo "start job" sleep 120 echo "bye" |