User Guides

Running Jobs on CARC Systems

This guide describes how to reserve compute resources and run and monitor jobs on CARC's high-performance computing (HPC) clusters, including both the general-use Discovery cluster and the Endeavour condo cluster, using the Slurm job scheduler.

This guide describes how to run and monitor jobs using the command line. Jobs can also be run and monitored using CARC OnDemand, a web-based access point for CARC systems. See the Getting Started with CARC OnDemand user guide for more information.

What is a job?

A job consists of all the data, commands, scripts, and programs that will be used to obtain results.

Jobs can be either batch jobs or interactive jobs, but both types have two main components:

A request for compute resources
A set of actions to run on those compute resources

A common mistake for new users of HPC clusters is to run heavy workloads directly on a login node (e.g., discovery.usc.edu or endeavour.usc.edu). Unless you are only running a small test, please make sure to run your program as a job. Processes left running on login nodes may be terminated without warning.

What is a job scheduler?

The Discovery cluster is a general-use shared resource, and Endeavour condo cluster partitions are shared resources among a specific research group. To ensure fair access, we use a job scheduler to manage all requests for compute resources. Specifically, we use the open-source Slurm (Simple Linux Utility for Resource Management) job scheduler that allocates compute resources on clusters for queued, user-defined jobs. It performs the following functions:

Schedules user-submitted jobs
Allocates user-requested compute resources
Processes user-submitted jobs

When users submit jobs with Slurm, the available compute resources are divided among the current job queue using a fairshare algorithm. Using Slurm means your program will be run as a job on a compute node(s) instead of being run directly on the cluster's login node.

Jobs also depend on project account allocations, and each job will subtract from a project's allocated system units. You can use the myaccount command to see your available and default accounts and your usage for each.

Slurm commands

The following table provides a summary of Slurm commands:

Category	Command	Purpose
Cluster info	sinfo	Display compute partition and node information
Submitting jobs	sbatch	Submit a job script for remote execution
	srun	Launch parallel tasks (i.e., job steps) (typically for MPI jobs)
	salloc	Allocate resources for an interactive job
	squeue	Display status of jobs and job steps
	sprio	Display job priority information
	scancel	Cancel pending or running jobs
Monitoring jobs	sstat	Display status information for running jobs
	sacct	Display accounting information for jobs
	seff	Display job efficiency information for past jobs
Other	scontrol	Display or modify Slurm configuration and state

To learn about all the options available for each command, enter man <command> for a specific command or see the official Slurm documentation.

Custom CARC Slurm commands

The following table lists custom CARC commands (based on Slurm commands):

Command	Purpose
myaccount	Display user's Slurm account information
acctusage	Display Slurm account usage information
nodeinfo	Display node information by partition, CPU/GPU models, and state
gpuinfo	Dispaly GPU states
myqueue	Display user's job queue information
cqueue	Display job queue information (filter by partition and/or users)
jobhist	Display a compact history of user's jobs with basic information
jobinfo	Add the job number after this command to display detailed information about that job

Batch jobs

A batch job is a set of actions that is performed remotely on one or more compute nodes. Batch jobs are implemented in job scripts, which are a special type of Bash script that specifies requested resources and a set of actions to perform. The main advantage of batch jobs is that they do not require manual intervention to run properly. This makes them ideal for programs that run for a long time.

The following is a typical batch job workflow:

Create job script
Submit job script with sbatch
Check job status with myqueue
When job completes, check log file
If job failed, modify job and resubmit (back to step 1)
- Check job information with jobinfo if needed
If job succeeded, check job information with jobinfo
- If possible, use less resources next time for a similar job

Below is an example batch job script that runs a Python script:

#!/bin/bash

#SBATCH --account=<project_id>
#SBATCH --partition=main
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=8
#SBATCH --mem=16G
#SBATCH --time=1:00:00

module purge
module load gcc/11.3.0
module load python/3.9.12

python3 script.py

The top line:

#!/bin/bash

specifies which shell interpreter to use when running your script. The bash interpreter is specified, so everything in the job script should be in Bash syntax.

The next set of lines:

#SBATCH --account=<project_id>
#SBATCH --partition=main
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=8
#SBATCH --mem=16G
#SBATCH --time=1:00:00

use Slurm options to specify the requested resources for your job.

Make sure to use the correct account for your jobs (<project_id>). Without the --account option, your default project account will be used. This is fine if you only have one project account. Your project IDs can be found in the User Portal on the Allocations page. Project IDs are of the form <PI_username>_<id> where <PI_username> is the username of the project's Principal Investigator (PI) and <id> is a 2 or 3 digit project ID number (e.g., ttrojan_123). Enter myaccount to see your available project accounts.

The next set of lines:

module purge
module load gcc/11.3.0
module load python/3.9.12

loads the required software modules (module load ...).

The final line:

python3 script.py

is the command that runs a Python script. More generally, these final lines will be the command or set of commands needed to run your program or multiple programs.

To submit a batch job, use Slurm's sbatch command when logged in to the Discovery or Endeavour cluster:

sbatch my.job

where the argument to the command is the job script's file name (e.g., my.job).

Submitted jobs are sent to the job scheduler, placed in the queue, and then processed remotely when the requested resources become available. The job process is recorded and written to an output file in the same directory that your job script is stored in. By default, this output file is named slurm-<jobid>.out. This is a plain-text file, so you can view it using the less command:

less slurm-<jobid>.out

To exit, enter q.

If you are submitting jobs on the Endeavour condo cluster (endeavour.usc.edu), you will need to specify the name of a condo partition along with the appropriate project ID. This can be done using the --partition and --account options in your job script:

#SBATCH --account=<project_id>
#SBATCH --partition=<partition_name>

If you encounter an error similar to:

Invalid account or account/partition combination specified

when submitting a job, double check that you have entered the right partition name and project ID or contact us at carc-condo@usc.edu to obtain the correct combination.

Slurm can also provide e-mail notifications when your job begins and ends. Enable notifications by adding the following options in your job script:

#SBATCH --mail-type=all
#SBATCH --mail-user=<e-mail address>

Interactive jobs

An interactive job logs you on to one or more compute nodes where you can work interactively. All actions are performed on the command line. The main advantage of interactive jobs is that you get immediate feedback and the job will not end (and relinquish your compute resources) if your command, script, or program encounters an error and terminates. This is especially useful for developing scripts and programs as well as debugging.

Use Slurm's salloc command to reserve resources on a node:

[user@discovery1 ~]$ salloc --time=2:00:00 --cpus-per-task=8 --mem=16G --account=<project_id>
salloc: Pending job allocation 324316
salloc: job 324316 queued and waiting for resources
salloc: job 324316 has been allocated resources
salloc: Granted job allocation 324316
salloc: Waiting for resource configuration
salloc: Nodes d17-03 are ready for job

Make sure to change the resource requests (the --time=2:00:00 --cpus-per-task=8 --mem=16G --account=<project_id> part after your salloc command) as needed, such as the number of cores and memory required. Also make sure to substitute <project_id> for your project ID.

Once you are granted the resources and logged in to a compute node, you can then begin entering commands (such as loading software modules):

[user@d17-03 ~]$ module load gcc/11.3.0 python/3.9.12
[user@d17-03 ~]$ python --version
Python 3.9.12
[user@d17-03 ~]$

Notice that the shell prompt has changed from user@discovery1 to user@<nodename> to indicate that you are now on a compute node (e.g., d17-03).

To exit the node and relinquish the job resources, enter exit in the shell. This will return you to the login node:

[user@d17-03 ~]$ exit
exit
salloc: Relinquishing job allocation 324316
[user@discovery1 ~]$

Resource requests

Slurm allows you to specify many different types of resources. The following table describes the more common resource request options and their default values:

Resource	Default value	Description
`--nodes=<number>`	1	Number of nodes to use
`--ntasks=<number>`	1	Number of processes to run
`--ntasks-per-node=<number>`	1	Number of processes to run (per node)
`--cpus-per-task=<number>`	1	Number of CPU cores per task
`--mem=<number>`	N/A	Total memory (per node)
`--mem-per-cpu=<number>`	2G	Memory per CPU core
`--partition=<partition_name>`	main	Request nodes on specific partition
`--constraint=<features>`	N/A	Request nodes with specific features (e.g., `xeon-4116`)
`--nodelist=<nodes>`	N/A	Request specific nodes (e.g., `e09-18,e23-02`)
`--exclude=<nodes>`	N/A	Exclude specific nodes (e.g., `e09-18,e23-02`)
`--exclusive`	N/A	Request all CPUs and GPUs on nodes
`--time=<D-HH:MM:SS>`	1:00:00	Maximum run time
`--account=<project_id>`	default project account	Account to charge resources to

If a resource option is not specified, the default value will be used.

CARC compute nodes have varying numbers of cores and amounts of memory. For more information on node specs, see the Discovery Resource Overview or Endeavour Resource Overview.

Nodes, tasks, and CPUs

The default value for the --nodes option is 1. This number should typically be increased if you are running a parallel job using MPI, but otherwise it should be unchanged. The default value for --ntasks is also 1, and it should typically be increased if running a multi-node parallel job.

The --cpus-per-task option refers to logical CPUs. On CARC compute nodes, there are typically two physical processors (sockets) per node with multiple cores per processor and one thread per core, such that 1 logical CPU = 1 core = 1 thread. These terms may be used interchangeably. Nodes on the main partition have varying numbers of cores (16, 20, 24, 32). This option should be changed depending on the nature of your job. Serial jobs only require 1 core, the default value. Additionally, single-threaded MPI jobs only require 1 core per task. For multi-threaded jobs of any kind, the value should be increased as needed to take advantage of multiple cores.

For more information on MPI jobs, see the Using MPI user guide.

Memory

The #SBATCH --mem=0 option tells Slurm to reserve all of the available memory on each compute node requested. Otherwise, the max memory (#SBATCH --mem=<number>) or max memory per CPU (#SBATCH --mem-per-cpu=<number>) can be specified as needed.

Note that some memory on each node is reserved for system overhead. Here is a scenario where you may forget about considering memory overhead:

A compute node consisting of 24 CPUs with specs stating 96 GB of shared memory really has ~92 GB of usable memory. You may tabulate "96 GB / 24 CPUs = 4 GB per CPU" and add #SBATCH --mem-per-cpu=4G to your job script. Slurm may alert you to an incorrect memory request and not submit the job.

In this case, setting #SBATCH --mem-per-cpu=3G or #SBATCH --mem=0 or some value less than 92 GB will resolve this issue.

GPUs

To request a GPU on Discovery's GPU partition, add the following line to your Slurm job script:

#SBATCH --partition=gpu

Also add one of the following sbatch options to your Slurm job script to request the type and number of GPUs you'd like to use:

#SBATCH --gpus-per-task=<number>

#SBATCH --gpus-per-task=<gpu_type>:<number>

where:

<number> is the number of GPUs per task requested, and
<gpu_type> is a GPU model.

For more information, see the Using GPUs user guide.

Queue times

Generally, if you request a lot of resources or very specific, limited resources, you can expect to wait longer for Slurm to assign resources to your job. Additionally, if the Discovery cluster or your condo cluster partition is especially active, you can also expect to wait longer for Slurm to assign resources to your job.

If your job is pending with a reason of Priority, this means other users have higher priority at that time. Job priority is mostly determined by your fairshare score, which is determined by resource usage in the past 30 days. It also depends on how large the job is and how long it has been pending in the queue.

Job monitoring

There are a number of ways to monitor your jobs, either with Slurm or other tools.

Job monitoring with Slurm

After submitting a job, there are a few different ways to monitor its progress with Slurm. The first is to check the job queue with the squeue command. For example, to check your own jobs:

squeue --me

The output will look similar to the following:

  JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
4822945      main     test  ttrojan PD       0:00      1 (Priority)

It provides the following information:

Output	Definition
JOBID	Unique numeric ID assigned to each job
PARTITION	Partition the job is running on
NAME	Job name (by default, the filename of the job script)
USER	User that submitted the job
ST	Current state of the job (see table below)
TIME	Amount of time the job has been running
NODES	Number of nodes the job is running on
NODELIST(REASON)	If running, the list of nodes the job is running on. If pending, the reason the job is waiting.

The ST column refers to the state of the job. The following table provides common codes:

Code	State	Meaning
PD	Pending	Job is pending (e.g., waiting for requested resources)
R	Running	Job is running
CG	Completing	Job is completing
CD	Completed	Job has completed
CA	Cancelled	Job was cancelled

The information that squeue returns can be customized; refer to the squeue manual for more information. CARC also provides a custom output format with the myqueue and cqueue commands.

When a job is running, you can use the sstat command to monitor the job for specific values of interest at the time the command is entered. For example, to focus on a few important values, use a command like the following:

sstat -o JobID,MaxRSS,AveCPUFreq,MaxDiskRead,MaxDiskWrite -j <jobid>

When a job is running or completed, you can use the sacct command to obtain accounting information about the job. Entering sacct without options will provide information for your jobs submitted on the current day, with a default output format. To focus on values similar to those from sstat after a job has completed, use a command like the following:

sacct -o JobID,MaxRSS,AveCPUFreq,MaxDiskRead,MaxDiskWrite,State,ExitCode -j <jobid>

Once a job has completed, you can also use the jobinfo command to obtain job information including CPU and memory efficiency. For example:

jobinfo <jobid>

Job monitoring from output

Most programs will generate some form of output when run. This can be in the form of status messages sent to the terminal or newly generated files. If running a batch job, Slurm will redirect output meant for the terminal to an output file of the form slurm-<jobid>.out. View the contents of this file to see output from the job. If needed, you can also run your programs through profiling or process monitoring tools in order to generate additional output.

Job monitoring on compute nodes

When your job is actively running, the output from squeue will report the compute node(s) that your job has been allocated. You can log on to these nodes using the command ssh <nodename> from one of the login nodes and then use a process monitoring tool like ps, top, htop, atop, iotop, or glances in order to check the CPU, memory, or I/O usage of your job processes. For GPUs, you can use the nvidia-smi or nvtop tools.

Job exit codes

Slurm provides exit codes when a job completes. An exit code of 0 means success and anything non-zero means failure. The following is a reference guide for interpreting these exit codes:

Code	Meaning	Note
0	Success
1	General failure
2	Incorrect use of shell builtins
3-124	Some error in job	Check software exit codes
125	Out of memory
126	Command cannot execute
127	Command not found
128	Invalid argument to exit
129-192	Job terminated by Linux signals	Subtract 128 from the number and match to signal code. Enter `kill -l` to list signal codes and enter `man signal` for more information.

Slurm environment variables

Slurm creates a set of environment variables when jobs run. These can be used in job or application scripts to dynamically create directories, assign the number of threads, etc. depending on the job specification. Some example variables:

Variable	Description
`SLURM_JOB_ID`	The ID of the job allocation
`SLURM_JOB_NODELIST`	List of nodes allocated to the job
`SLURM_JOB_NUM_NODES`	Total number of nodes in the job's resource allocation
`SLURM_NTASKS`	Number of tasks requested
`SLURM_CPUS_PER_TASK`	Number of CPUs requested per task
`SLURM_SUBMIT_DIR`	The directory from which `sbatch` was invoked
`SLURM_ARRAY_TASK_ID`	Job array ID (index) number

For example, to assign OpenMP threads, you could include a line like the following in your job script:

export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK

Job limits

Discovery is a shared resource, so we put limits on the size and duration of jobs to ensure everyone has a chance to run jobs:

Partition	Maximum run time	Maximum concurrent CPUs	Maximum concurrent GPUs	Maximum concurrent memory	Maximum concurrent jobs running	Maximum concurrent jobs queued
main	48 hours	2,000	36	---	500	5,000
epyc-64	48 hours	2,000	---	---	500	5,000
gpu	48 hours	400	36	---	36	100
oneweek	168 hours	208	---	---	50	50
largemem	168 hours	64	---	1000 GB	3	10
debug	1 hour	48	4	---	5	5

Endeavour condo partitions can also be configured with custom limits upon request.

Process limits on login nodes

Login nodes serve as the main user interface for the CARC clusters and are shared among all the users across the university. These nodes are only intended for basic tasks, such as managing files, editing scripts, and managing jobs. It only takes a few users performing computationally intensive tasks on the login nodes to result in slow performance for all users across the CARC clusters.

To ensure smooth operation for everyone, there is a limit on the total number of processes an individual user can spawn on the login nodes in an effort to prevent these shared resources from becoming saturated and sluggish.

Each login node has a 64-process, 4 CPU core, and 32 GB memory limit in place.

If a user exceeds these limits, they may not be able to access the cluster or their application may be aborted. Here are a few examples of process utilization:

Each connection to the login node through the terminal spawns two processes, as well as one process for the ssh-agent per user.

Please note that accessing the cluster via the Remote SSH extension of VSCode may be blocked, since this extension spawns too many processes on the login nodes, exceeding the limit. Additionally, the processes started by Remote SSH are not properly killed after the user logs out of the application. This may cause an account lockout, preventing the user from accessing the cluster, even from the terminal. It is recommended to use the SSH-FS extension in VSCode instead.

When a user launches Python and imports a package that relies on OpenBLAS (e.g., Numpy), this library will auto-detect the number of CPUs available on a node and create a thread pool based on that number. This exceeds the process limit imposed on the login nodes and causes Python to crash. If it is absolutely necessary to run this kind of script on the login node, limit the number of threads by setting these environmental variables:

export MKL_NUM_THREADS=1
export NUMEXPR_NUM_THREADS=1
export OMP_NUM_THREADS=1

The best approach to installing a package or debugging code on the cluster is to request an interactive node on the debug partition—or any other compute node—to complete your tasks. To do this, use a commmand like the following:

salloc -p debug -c 4

This will request 4 CPU cores on a debug node for 1 hour.

Account limits

Jobs on Discovery also depend on your project account allocations, and each job will subtract from your project's allocated System Units (SUs) depending on the types of resources you request. SUs are determined by CPU, memory, and GPU usage combined with job run time. The following table breaks down SUs charged per minute:

Resource reserved for 1 minute	SUs charged
1 CPU	1
4 GB memory	1
1 A100 or A40 GPU	8
1 V100 or P100 GPU	4
1 K40 GPU	2

For example:

1 CPU with 8 GB memory used for 1 hour = 180 SUs
8 CPUs with 320 GB memory and 8 A100 GPUs used for 1 hour = 9120 SUs

SUs are charged based on resources reserved for your job, not the resources that your job actually uses. For example, if you reserve 8 CPUs for your job but your job actually only uses 1 CPU, your account still gets charged for 8 CPUs. Try not to request more resources than your job actually requires.

Use the myaccount command to see your available and default account allocations.

To see current usage for a project account and individual contributions to the total usage, use Slurm's sshare command like the following:

sshare --clusters=discovery --format=Account,GrpTRESMins,User,EffectvUsage,GrpTRESRaw%100 --all --accounts=ttrojan_123

Make sure to substitute the project account ID that you want to see. Multiple accounts can be specified with comma separation.

Additional resources

Slurm documentation
Slurm Quick Start User Guide
Slurm tutorials
CARC Slurm Cheatsheet
CARC Slurm Job Script Templates
CARC workshop materials for Slurm Job Management