Running Jobs
- 0.0.1 What is a job?
- 0.0.2 What is a job scheduler?
- 0.0.3 Slurm commands
- 0.0.4 Custom CARC Slurm commands
- 0.0.5 Batch jobs
- 0.0.6 Interactive jobs
- 0.0.7 Resource requests
- 0.0.8 Queue times
- 0.0.9 Job monitoring
- 0.0.10 Job exit codes
- 0.0.11 Slurm environment variables
- 0.0.12 Job limits
- 0.0.13 Process limits on login nodes
- 0.0.14 Account limits
- 0.0.15 Additional resources
This guide describes how to reserve compute resources and run and monitor jobs on CARC’s high-performance computing (HPC) clusters, including both the general-use Discovery cluster and the Endeavour condo cluster, using the Slurm job scheduler.
To learn how to run and monitor jobs using CARC OnDemand, a web-based access point for CARC systems, see the Running Jobs with CARC OnDemand user guide.
0.0.1 What is a job?
A job consists of all the data, commands, scripts, and programs that will be used to obtain results.
Jobs can be either batch jobs or interactive jobs, but both types have two main components:
- A request for compute resources
- A set of actions to run on those compute resources
Do not run heavy workloads directly on a login node (e.g., discovery.usc.edu
or endeavour.usc.edu
). Unless you are only running a small test, run your program as a job. Processes left running on login nodes may be terminated without warning.
0.0.2 What is a job scheduler?
The Discovery cluster is a general-use shared resource, and Endeavour condo cluster partitions are shared resources among a specific research group. To ensure fair access, we use a job scheduler to manage all requests for compute resources. Specifically, we use the open-source Slurm (Simple Linux Utility for Resource Management) job scheduler that allocates compute resources on clusters for queued, user-defined jobs. It performs the following functions:
- Schedules user-submitted jobs
- Allocates user-requested compute resources
- Processes user-submitted jobs
When users submit jobs with Slurm, the available compute resources are divided among the current job queue using a fairshare algorithm. Using Slurm means your program will be run as a job on a compute node(s) instead of being run directly on the cluster’s login node.
Jobs also depend on project account allocations, and each job will subtract from a project’s allocated system units. You can run the myaccount
command to see your available and default accounts and your usage for each.
0.0.3 Slurm commands
The following table provides a summary of Slurm commands:
Category | Command | Purpose |
---|---|---|
Cluster info | sinfo | Display compute partition and node information |
Submitting jobs | sbatch | Submit a job script for remote execution |
srun | Launch parallel tasks (i.e., job steps) (typically for MPI jobs) | |
salloc | Allocate resources for an interactive job | |
squeue | Display status of jobs and job steps | |
sprio | Display job priority information | |
scancel | Cancel pending or running jobs | |
Monitoring jobs | sstat | Display status information for running jobs |
sacct | Display accounting information for jobs | |
seff | Display job efficiency information for past jobs | |
Other | scontrol | Display or modify Slurm configuration and state |
To learn about all the options available for each command, enter man <command>
for a specific command or see the official Slurm documentation.
0.0.4 Custom CARC Slurm commands
The following table lists custom CARC commands (based on Slurm commands):
Command | Purpose |
---|---|
myaccount | View account information for user |
nodeinfo | View node information by partition, CPU/GPU models, and state |
noderes | View node resources |
myqueue | View job queue information for user |
jobqueue | View job queue information |
jobhist | View compact history of user’s jobs |
jobinfo | View detailed job information |
Each command has an associated help page (e.g., jobinfo --help
).
0.0.5 Batch jobs
A batch job is a set of actions that is performed remotely on one or more compute nodes. Batch jobs are implemented in job scripts, which are a special type of Bash script that specifies requested resources and a set of actions to perform. The main advantage of batch jobs is that they do not require manual intervention to run properly. This makes them ideal for programs that run for a long time.
The following is a typical batch job workflow:
- Create job script
- Submit job script with
sbatch
- Check job status with
myqueue
- When job completes, check log file
- If job failed, modify job and resubmit (back to step 1)
- Check job information with
jobinfo
if needed
- Check job information with
- If job succeeded, check job information with
jobinfo
- If possible, use less resources next time for a similar job
Below is an example batch job script that runs a Python script:
#!/bin/bash
#SBATCH --account=<project_id>
#SBATCH --partition=main
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=8
#SBATCH --mem=16G
#SBATCH --time=1:00:00
module purge
module load gcc/11.3.0
module load python/3.11.3
python3 script.py
The top line:
#!/bin/bash
specifies which shell interpreter to use when running your script. The bash
interpreter is specified, so everything in the job script should be in Bash syntax.
The next set of lines:
#SBATCH --account=<project_id>
#SBATCH --partition=main
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=8
#SBATCH --mem=16G
#SBATCH --time=1:00:00
use Slurm options to specify the requested resources for your job.
Make sure to use the correct account for your jobs (<project_id>
). Without the --account
option, your default project account will be used. This is fine if you only have one project account. Your project IDs can be found in the User Portal on the Allocations page. Project IDs are of the form <PI_username>_<id>
where <PI_username>
is the username of the project’s Principal Investigator (PI) and <id>
is a 2 or 3 digit project ID number (e.g., ttrojan_123). Enter myaccount
to see your available project accounts.
The next set of lines:
module purge
module load gcc/11.3.0
module load python/3.11.3
loads the required software modules (module load ...
).
The final line:
python3 script.py
is the command that runs a Python script. More generally, these final lines will be the command or set of commands needed to run your program or multiple programs.
To submit a batch job, use Slurm’s sbatch
command when logged in to the Discovery or Endeavour cluster:
sbatch my.job
where the argument to the command is the job script’s file name (e.g., my.job
).
Submitted jobs are sent to the job scheduler, placed in the queue, and then processed remotely when the requested resources become available. The job process is recorded and written to an output file in the same directory that your job script is stored in. By default, this output file is named slurm-<jobid>.out
. This is a plain-text file, so you can view it using the less
command:
less slurm-<jobid>.out
To exit, enter q
.
If you are submitting jobs on the Endeavour condo cluster (endeavour.usc.edu
), you will need to specify the name of a condo partition along with the appropriate project ID. This can be done using the --partition
and --account
options in your job script:
#SBATCH --account=<project_id>
#SBATCH --partition=<partition_name>
If you encounter an error similar to:
Invalid account or account/partition combination specified
when submitting a job, double check that you have entered the right partition name and project ID. Run the myaccount
command to see your accounts and allowed Endeavour partitions.
Slurm can also provide e-mail notifications when your job begins and ends. Enable notifications by adding the following options in your job script:
#SBATCH --mail-type=all
#SBATCH --mail-user=<e-mail address>
0.0.6 Interactive jobs
An interactive job logs you on to one or more compute nodes where you can work interactively. All actions are performed on the command line. The main advantage of interactive jobs is that you get immediate feedback and the job will not end (and relinquish your compute resources) if your command, script, or program encounters an error and terminates. This is especially useful for developing scripts and programs as well as debugging.
Use Slurm’s salloc
command to reserve resources on a node:
[user@discovery1 ~]$ salloc --time=2:00:00 --cpus-per-task=8 --mem=16G --account=<project_id>
salloc: Pending job allocation 324316
salloc: job 324316 queued and waiting for resources
salloc: job 324316 has been allocated resources
salloc: Granted job allocation 324316
salloc: Waiting for resource configuration
salloc: Nodes d17-03 are ready for job
Make sure to change the resource requests (the --time=2:00:00 --cpus-per-task=8 --mem=16G --account=<project_id>
part after your salloc
command) as needed, such as the number of cores and memory required. Also make sure to substitute <project_id>
for your project ID.
Once you are granted the resources and logged in to a compute node, you can then begin entering commands (such as loading software modules):
[user@d17-03 ~]$ module load gcc/11.3.0 python/3.11.3
[user@d17-03 ~]$ python --version
Python 3.11.3
[user@d17-03 ~]$
Notice that the shell prompt has changed from user@discovery1
to user@<nodename>
to indicate that you are now on a compute node (e.g., d17-03
).
To exit the node and relinquish the job resources, enter exit
in the shell. This will return you to the login node:
[user@d17-03 ~]$ exit
exit
salloc: Relinquishing job allocation 324316
[user@discovery1 ~]$
0.0.7 Resource requests
Slurm allows you to specify many different types of resources. The following table describes the more common resource request options and their default values:
Resource | Default value | Description |
---|---|---|
--nodes=<number> |
1 | Number of nodes to use |
--ntasks=<number> |
1 | Number of processes to run |
--ntasks-per-node=<number> |
1 | Number of processes to run (per node) |
--cpus-per-task=<number> |
1 | Number of CPU cores per task |
--mem=<number> |
N/A | Total memory (per node) |
--mem-per-cpu=<number> |
2G | Memory per CPU core |
--partition=<partition_name> |
main | Request nodes on specific partition |
--constraint=<features> |
N/A | Request nodes with specific features (e.g., xeon-4116 ) |
--nodelist=<nodes> |
N/A | Request specific nodes (e.g., e09-18,e23-02 ) |
--exclude=<nodes> |
N/A | Exclude specific nodes (e.g., e09-18,e23-02 ) |
--exclusive |
N/A | Request all CPUs and GPUs on nodes |
--time=<D-HH:MM:SS> |
1:00:00 | Maximum run time |
--account=<project_id> |
default project account | Account to charge resources to |
If a resource option is not specified, the default value will be used.
CARC compute nodes have varying numbers of cores and amounts of memory. For more information on node specs, see the Discovery Resource Overview or Endeavour Resource Overview.
0.0.7.1 Nodes, tasks, and CPUs
The default value for the --nodes
option is 1. This number should typically be increased if you are running a parallel job using MPI, but otherwise it should be unchanged. The default value for --ntasks
is also 1, and it should typically be increased if running a multi-node parallel job.
The --cpus-per-task
option refers to logical CPUs. On CARC compute nodes, there are typically two physical processors (sockets) per node with multiple cores per processor and one thread per core, such that 1 logical CPU = 1 core = 1 thread. These terms may be used interchangeably. Nodes on the main partition have varying numbers of cores (16, 20, 24, 32). This option should be changed depending on the nature of your job. Serial jobs only require 1 core, the default value. Additionally, single-threaded MPI jobs only require 1 core per task. For multi-threaded jobs of any kind, the value should be increased as needed to take advantage of multiple cores.
For more information on MPI jobs, see the MPI guide.
0.0.7.2 Memory
The #SBATCH --mem=0
option tells Slurm to reserve all of the available memory on each compute node requested. Otherwise, the max memory (#SBATCH --mem=<number>
) or max memory per CPU (#SBATCH --mem-per-cpu=<number>
) can be specified as needed.
Note that some memory on each node is reserved for system overhead. Here is a scenario where you may forget about considering memory overhead:
A compute node consisting of 24 CPUs with specs stating 96 GB of shared memory really has ~92 GB of usable memory. You may tabulate “96 GB / 24 CPUs = 4 GB per CPU” and add
#SBATCH --mem-per-cpu=4G
to your job script. Slurm may alert you to an incorrect memory request and not submit the job.
In this case, setting
#SBATCH --mem-per-cpu=3G
or#SBATCH --mem=0
or some value less than 92 GB will resolve this issue.
0.0.7.3 GPUs
To request a GPU on Discovery’s GPU partition, add the following line to your Slurm job script:
#SBATCH --partition=gpu
Also add one of the following sbatch
options to your Slurm job script to request the type and number of GPUs you’d like to use:
#SBATCH --gpus-per-task=<number>
or
#SBATCH --gpus-per-task=<gpu_type>:<number>
where:
<number>
is the number of GPUs per task requested, and<gpu_type>
is a GPU model.
For more information, see the GPUs guide.
0.0.8 Queue times
Generally, if you request a lot of resources or very specific, limited resources, you can expect to wait longer for Slurm to assign resources to your job. Additionally, if the Discovery cluster or your condo cluster partition is especially active, you can also expect to wait longer for Slurm to assign resources to your job.
If your job is pending with a reason of Priority, this means other users have higher priority at that time. Job priority is mostly determined by your fairshare score, which is determined by resource usage in the past 30 days. It also depends on how large the job is and how long it has been pending in the queue.
0.0.9 Job monitoring
There are a number of ways to monitor your jobs, either with Slurm or other tools.
0.0.9.1 Job monitoring with Slurm
After submitting a job, there are a few different ways to monitor its progress with Slurm. The first is to check the job queue with the squeue
command. For example, to check your own jobs:
squeue --me
The output will look similar to the following:
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
4822945 main test ttrojan PD 0:00 1 (Priority)
It provides the following information:
Output | Definition |
---|---|
JOBID | Unique numeric ID assigned to each job |
PARTITION | Partition the job is running on |
NAME | Job name (by default, the filename of the job script) |
USER | User that submitted the job |
ST | Current state of the job (see table below) |
TIME | Amount of time the job has been running |
NODES | Number of nodes the job is running on |
NODELIST(REASON) | If running, the list of nodes the job is running on. If pending, the reason the job is waiting. |
The ST
column refers to the state of the job. The following table provides common codes:
Code | State | Meaning |
---|---|---|
PD | Pending | Job is pending (e.g., waiting for requested resources) |
R | Running | Job is running |
CG | Completing | Job is completing |
CD | Completed | Job has completed |
CA | Cancelled | Job was cancelled |
The information that squeue
returns can be customized; refer to the squeue
manual for more information. CARC also provides a custom output format with the myqueue
and jobqueue
commands.
When a job is running, you can use the sstat
command to monitor the job for specific values of interest at the time the command is entered. For example, to focus on a few important values, use a command like the following:
sstat -o JobID,MaxRSS,AveCPUFreq,MaxDiskRead,MaxDiskWrite -j <jobid>
When a job is running or completed, you can use the sacct
command to obtain accounting information about the job. Entering sacct
without options will provide information for your jobs submitted on the current day, with a default output format. To focus on values similar to those from sstat
after a job has completed, use a command like the following:
sacct -o JobID,MaxRSS,AveCPUFreq,MaxDiskRead,MaxDiskWrite,State,ExitCode -j <jobid>
Once a job has completed, you can also use the jobinfo
command to obtain job information including CPU and memory efficiency. For example:
jobinfo <jobid>
0.0.9.2 Job monitoring from output
Most programs will generate some form of output when run. This can be in the form of status messages sent to the terminal or newly generated files. If running a batch job, Slurm will redirect output meant for the terminal to an output file of the form slurm-<jobid>.out
. View the contents of this file to see output from the job. If needed, you can also run your programs through profiling or process monitoring tools in order to generate additional output.
0.0.9.3 Job monitoring on compute nodes
When your job is actively running, the output from squeue
will report the compute node(s) that your job has been allocated. You can log on to these nodes using the command ssh <nodename>
from one of the login nodes and then use a process monitoring tool like ps
, top
, htop
, atop
, iotop
, or glances
in order to check the CPU, memory, or I/O usage of your job processes. For GPUs, you can use the nvidia-smi
or nvtop
tools.
0.0.10 Job exit codes
Slurm provides exit codes when a job completes. An exit code of 0 means success and anything non-zero means failure. The following is a reference guide for interpreting these exit codes:
Code | Meaning | Note |
---|---|---|
0 | Success | |
1 | General failure | |
2 | Incorrect use of shell built-ins | |
3-124 | Some error in job | Check software exit codes |
125 | Out of memory | |
126 | Command cannot execute | |
127 | Command not found | |
128 | Invalid argument to exit | |
129-192 | Job terminated by Linux signals | Subtract 128 from the number and match to signal code. Enter kill -l to list signal codes and enter man signal for more information. |
0.0.11 Slurm environment variables
Slurm creates a set of environment variables when jobs run. These can be used in job or application scripts to dynamically create directories, assign the number of threads, etc. depending on the job specification. Some example variables:
Variable | Description |
---|---|
SLURM_JOB_ID |
The ID of the job allocation |
SLURM_JOB_NODELIST |
List of nodes allocated to the job |
SLURM_JOB_NUM_NODES |
Total number of nodes in the job’s resource allocation |
SLURM_NTASKS |
Number of tasks requested |
SLURM_CPUS_PER_TASK |
Number of CPUs requested per task |
SLURM_SUBMIT_DIR |
The directory from which sbatch was invoked |
SLURM_ARRAY_TASK_ID |
Job array ID (index) number |
For example, to assign OpenMP threads, you could include a line like the following in your job script:
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
0.0.12 Job limits
Discovery is a shared resource, so we put limits on the size and duration of jobs to ensure everyone has a chance to run jobs:
Partition | Maximum run time | Maximum concurrent CPUs | Maximum concurrent GPUs | Maximum concurrent memory | Maximum concurrent jobs running | Maximum concurrent jobs queued |
---|---|---|---|---|---|---|
main + epyc-64 | 48 hours | 2,000 | 36 | — | 100 | 5,000 |
gpu | 48 hours | 400 | 36 | — | 36 | 100 |
oneweek | 168 hours | 208 | — | — | 50 | 50 |
largemem | 168 hours | 64 | — | 1500 GB | 3 | 10 |
debug | 1 hour | 48 | 4 | — | 5 | 5 |
Endeavour condo partitions can also be configured with custom limits upon request.
0.0.13 Process limits on login nodes
Login nodes serve as the main user interface for the CARC clusters and are shared among all the users across the university. These nodes are only intended for basic tasks, such as managing files, editing scripts, and managing jobs. It only takes a few users performing computationally intensive tasks on the login nodes to result in slow performance for all users across the CARC clusters.
To ensure smooth operation for everyone, there is a limit on the total number of processes an individual user can spawn on the login nodes in an effort to prevent these shared resources from becoming saturated and sluggish.
Each login node has a 64-process, 4 CPU core, and 32 GB memory limit in place.
If a user exceeds these limits, they may not be able to access the cluster or their application may be aborted. Here are a few examples of process utilization:
- Each connection to the login node through the terminal spawns two processes, as well as one process for the ssh-agent per user.
Please note that accessing the cluster via the Remote SSH extension of VSCode may be blocked, since this extension spawns too many processes on the login nodes, exceeding the limit. Additionally, the processes started by Remote SSH are not properly killed after the user logs out of the application. This may cause an account lockout, preventing the user from accessing the cluster, even from the terminal. It is recommended to use the SSH-FS extension in VSCode instead.
- When a user launches Python and imports a package that relies on OpenBLAS (e.g., Numpy), this library will auto-detect the number of CPUs available on a node and create a thread pool based on that number. This exceeds the process limit imposed on the login nodes and causes Python to crash. If it is absolutely necessary to run this kind of script on the login node, limit the number of threads by setting these environmental variables:
export MKL_NUM_THREADS=1
export NUMEXPR_NUM_THREADS=1
export OMP_NUM_THREADS=1
The best approach to installing a package or debugging code on the cluster is to request an interactive node on the debug partition—or any other compute node—to complete your tasks. To do this, use a command like the following:
salloc -p debug -c 4
This will request 4 CPU cores on a debug node for 1 hour.
0.0.14 Account limits
Jobs on Discovery also depend on your project account allocations, and each job will subtract from your project’s allocated System Units (SUs) depending on the types of resources you request. SUs are determined by CPU, memory, and GPU usage combined with job run time. The following table breaks down SUs charged per minute:
Resource reserved for 1 minute | SUs charged |
---|---|
1 CPU | 1 |
4 GB memory | 1 |
1 A100 or A40 GPU | 8 |
1 V100 or P100 GPU | 4 |
1 K40 GPU | 2 |
For example:
- 1 CPU with 8 GB memory used for 1 hour = 180 SUs
- 8 CPUs with 320 GB memory and 8 A100 GPUs used for 1 hour = 9120 SUs
SUs are charged based on resources reserved for your job, not the resources that your job actually uses. For example, if you reserve 8 CPUs for your job but your job actually only uses 1 CPU, your account still gets charged for 8 CPUs. Try not to request more resources than your job actually requires.
Use the myaccount
command to see your available and default account allocations.
To see current usage for a project account and individual contributions to the total usage, use Slurm’s sshare
command like the following:
sshare --clusters=discovery --format=Account,GrpTRESMins,User,EffectvUsage,GrpTRESRaw%100 --all --accounts=ttrojan_123
Make sure to substitute the project account ID that you want to see. Multiple accounts can be specified with comma separation.
0.0.15 Additional resources
Slurm documentation
Slurm Quick Start User Guide
Slurm tutorials
CARC Slurm Cheatsheet
CARC Slurm Job Script Templates
CARC workshop materials for Slurm Job Management