Using MPI
- 0.0.1 Loading MPI implementations
- 0.0.2 Compiling MPI programs
- 0.0.3 Running single-threaded MPI programs
- 0.0.4 Running multi-threaded MPI programs
- 0.0.5 Process and thread affinity
- 0.0.6 Running OpenMPI programs
- 0.0.7 Running MPICH programs
- 0.0.8 Running MVAPICH2 programs
- 0.0.9 Running Intel MPI programs
- 0.0.10 Additional resources
The Message Passing Interface (MPI) is a message-passing standard used in parallel programming, typically for multi-node, distributed processor and distributed memory use cases. The MPI standard exists in numerous implementations, four of which CARC provides support for:
- OpenMPI
- MPICH
- MVAPICH2
- Intel MPI
0.0.1 Loading MPI implementations
Begin by logging in. You can find instructions for this in the Getting Started with Discovery or Getting Started with Endeavour user guides.
Each MPI implementation has advantages and disadvantages.
MPI Implementation | Advantages | Disadvantages | Modules |
---|---|---|---|
OpenMPI | Flexible usage | Not ABI-compatible with other MPI implementations | module load gcc/13.3.0 openmpi/5.0.5 |
MPICH | Reference implementation, flexible usage | Not optimized for a specific platform | module load gcc/13.3.0 mpich/4.2.2 |
MVAPICH2 | Optimized for InfiniBand networks | Less flexible than other MPI implementations | module load gcc/13.3.0 mvapich2/2.3.7 |
Intel MPI | Optimized for Intel processors | Optimized for Intel processors | module load intel-oneapi/2024.1.0 |
Load the appropriate MPI implementation for your program. Other versions of the MPI implementations may be available. To see all available versions, search with module spider <MPI implementation>
.
MPICH, MVAPICH2, and Intel MPI can be freely interchanged because of their common Application Binary Interface (ABI). This means, for example, that you can compile a program with MPICH but run it using the Intel MPI libraries, thus taking advantage of the functionality of Intel MPI.
These four MPI implementations are stable and thread-safe and, prior to the advent of the Unified Communication X (UCX) framework, they exhibited similar performance. The UCX framework is a collaboration between industry, laboratories, and academia formed to create an open-source, production grade communication framework for data-centric and high-performance applications. UCX is performance-oriented, enabling a low overhead in communication paths, allowing a near native-level performance while establishing a cross-platform unified API supporting various network Host Card Adapters and processor technologies.
CARC clusters use InfiniBand networks.
MPI-UCX libraries are available only under the GCC programming environment because the Intel compilers cannot build the UCX framework. Both the OpenMPI and MPICH modules have been built with UCX support under the gcc
compiler module trees. We have found the OpenMPI-UCX and MPICH-UCX libraries to exhibit the best performance out of the four different MPI libraries available on CARC clusters.
0.0.2 Compiling MPI programs
Below is a list of MPI-specific compiler commands with their equivalent standard command versions:
Language | MPI Command (GCC, Intel) | Standard Command (GCC, Intel) |
---|---|---|
C | mpicc, mpiicc |
gcc, icc |
C++ | mpicxx, mpiicpc |
g++, icpc |
Fortran 77/90 | mpifort/mpif90/mpif77, mpiifort |
gfortran, ifort |
Note: Intel MPI supplies separate compiler commands (wrappers) for the Intel compilers, in the form of
mpiicc
,mpiicpc
, andmpiifort
. Usingmpicc
,mpicxx
, andmpif90
with Intel will call the GCC compilers.
When you compile an MPI program, record the module and version of MPI used and load the same module in your Slurm job script.
0.0.3 Running single-threaded MPI programs
On CARC clusters, we recommend running MPI programs with Slurm’s srun
command, which launches the parallel tasks. For help with srun
, please consult the manual page by entering man srun
or view the available options by entering srun --help
. You can also use the mpiexec
or mpirun
commands to run MPI programs, which may provide more options depending on the MPI implementation used.
To run single-threaded MPI programs, use a command like the following within a job script:
srun --mpi=pmix_v2 -n $SLURM_NTASKS ./mpi_program
The important parameter to include is the number of MPI processes (-n
). The Slurm-provided environment variable SLURM_NTASKS
corresponds to the Slurm task count (i.e., the number of MPI ranks) requested with the #SBATCH --ntasks
option in the job script (or the product of #SBATCH --nodes
and #SBATCH --ntasks-per-node
).
The default MPI process management interface for srun
is pmix_v5, so srun ...
is equivalent to srun --mpi=pmix_v5 ...
. The pmix_v5 interface is only used for the OpenMPI implementation, however. For other MPI implementations, use pmi2 instead: srun --mpi=pmi2 ...
.
An example job script for a single-threaded program:
#!/bin/bash
#SBATCH --account=<project_id>
#SBATCH --partition=epyc-64
#SBATCH --constraint=epyc-7513
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=64
#SBATCH --cpus-per-task=1
#SBATCH --mem=0
#SBATCH --time=24:00:00
module purge
module load gcc/13.3.0
module load openmpi/5.0.5
ulimit -s unlimited
srun --mpi=pmix_v5 -n $SLURM_NTASKS ./mpi_program
--constraint
specifies a node feature (e.g., CPU model) to use.--nodes
specifies how many nodes to use.--ntasks-per-node
specifies the number of tasks (MPI ranks) to run per node.--cpus-per-task
specifies the number of CPUs (threads) to use per task. There is 1 thread per CPU, so only 1 CPU per task is needed for a single-threaded MPI job.--mem=0
requests all available memory per node. Alternatively, use--mem-per-cpu
.
For best performance, MPI jobs should typically use entire nodes, ideally of the same type (CPU model, network interface, etc.). This can be achieved by using the --constraint
option to specify a CPU model to use, which will ensure a homogeneous hardware environment and consistency of communications for tasks and reduce latency. The other job options used should reflect the specific node characteristics (e.g., matching the number of tasks per node to the total number of CPUs per node) so that nodes are fully utilized.
Use the nodeinfo
command to see node types by partition (the CPU model column) and other node details (CPUs, memory, etc.).
When running on the main partition using OpenMPI or MPICH modules, which use the UCX framework for network communication, you may need to add export UCX_TLS=sm,tcp,self
to the job script to restrict the network transports that can be used.
0.0.4 Running multi-threaded MPI programs
Multi-threaded MPI programs use multi-threaded tasks, typically hybrid MPI/OpenMP programs. If using OpenMP for threading, the environment variable OMP_NUM_THREADS
should be set, which specifies the number of threads to parallelize over. The OMP_NUM_THREADS
count should equal the requested --cpus-per-task
option in the job script. You can use the Slurm-provided environment variable SLURM_CPUS_PER_TASK
to set OMP_NUM_THREADS
and then use srun
:
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
srun --mpi=pmix_v5 -n $SLURM_NTASKS -c $SLURM_CPUS_PER_TASK ./mpi_plus_openmp_program
An example job script for a multi-threaded program:
#!/bin/bash
#SBATCH --account=<project_id>
#SBATCH --partition=epyc-64
#SBATCH --constraint=epyc-7513
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=2
#SBATCH --cpus-per-task=32
#SBATCH --mem=0
#SBATCH --time=24:00:00
module purge
module load gcc/13.3.0
module load openmpi/5.0.5
ulimit -s unlimited
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
srun --mpi=pmix_v5 -n $SLURM_NTASKS -c $SLURM_CPUS_PER_TASK ./mpi_plus_openmp_program
--constraint
specifies a node feature (e.g., CPU model) to use.--nodes
specifies how many nodes to use.--ntasks-per-node
specifies the number of tasks (MPI ranks) to run per node.--cpus-per-task
specifies the number of CPUs (threads) to use per task.--mem=0
requests all available memory per node. Alternatively, use--mem-per-cpu
.
For best performance, MPI jobs should typically use entire nodes, ideally of the same type (CPU model, network interface, etc.). This can be achieved by using the --constraint
option to specify a CPU model to use, which will ensure a homogeneous hardware environment and consistency of communications for tasks and reduce latency. The other job options used should reflect the specific node characteristics (e.g., matching the number of tasks per node and CPUs per task to the total number of CPUs per node and setting CPUs per task to stay within NUMA domains) so that nodes are fully utilized.
Use the nodeinfo
command to see node types by partition (the CPU model column) and other node details (CPUs, memory, etc.).
When running on the main partition using OpenMPI or MPICH modules, which use the UCX framework for network communication, you may need to add export UCX_TLS=sm,tcp,self
to the job script to restrict the network transports that can be used.
0.0.5 Process and thread affinity
On CARC clusters, compute nodes have a Non-Uniform Memory Access (NUMA) shared-memory architecture, so the performance of MPI programs can be improved by pinning MPI tasks and/or OpenMP threads to CPUs (cores) or NUMA domains (equivalent to processor sockets). The pinning prevents the MPI tasks and OpenMP threads from migrating to CPUs that have a more distant path to data in memory. This is known as process and thread affinity.
All nodes on CARC clusters are configured with 2 NUMA domains per node (2 processor sockets with 1 multi-core processor and 1 NUMA domain per processor socket and an equal number of cores and amount of memory per processor). Each core can be considered a logical CPU. For example, an epyc-7542 node has 2 NUMA domains with 32 CPUs (cores) and 128 GB of memory per NUMA domain.
To view more detailed information about a node’s topology, enter module load gcc/13.3.0 hwloc
and run the lstopo
command on that node.
Slurm’s srun
command will automatically bind MPI tasks to CPUs (cores). This is typically the best strategy for compute-bound applications. Memory-bound applications may also want to use the #SBATCH --ntasks-per-socket
or #SBATCH --distribution
options in the job script to spread tasks across NUMA domains. To see all binding options available with srun
, enter srun --cpu-bind=help
. If using mpiexec
or mpirun
, consult the documentation of the MPI implementation that you use for more information about process and thread affinity and default settings. For hybrid MPI/OpenMP programs, also consult the OpenMP documentation for more information about thread affinity. You may need to experiment and compare profiling results from different affinity strategies to find the optimal strategy for your specific programs.
0.0.6 Running OpenMPI programs
OpenMPI is an open source implementation developed by a consortium of academic, research, and industry partners. It supports both InfiniBand and UCX. OpenMPI-UCX will automatically select the optimal network interface on compute nodes.
For single-threaded MPI programs, use the following setup:
module purge
module load gcc/13.3.0
module load openmpi/5.0.5
srun --mpi=pmix_v5 -n $SLURM_NTASKS ./mpi_program
For multi-threaded MPI programs, use the following setup:
module purge
module load gcc/13.3.0
module load openmpi/5.0.5
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
srun --mpi=pmix_v2 -n $SLURM_NTASKS -c $SLURM_CPUS_PER_TASK ./mpi_plus_openmp_program
0.0.7 Running MPICH programs
MPICH is an open source reference implementation developed at Argonne National Laboratories. The newer versions support both InfiniBand and UCX. MPICH-UCX will automatically select the optimal network interface on compute nodes.
For single-threaded MPI programs, use the following setup:
module purge
module load gcc/13.3.0
module load mpich/4.2.2
srun --mpi=pmi2 -n $SLURM_NTASKS ./mpi_program
For multi-threaded MPI programs, use the following setup:
module purge
module load gcc/13.3.0
module load mpich/4.2.2
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
srun --mpi=pmi2 -n $SLURM_NTASKS -c $SLURM_CPUS_PER_TASK ./mpi_plus_openmp_program
0.0.8 Running MVAPICH2 programs
MVAPICH2, developed by the Network-Based Computing Laboratory at the Ohio State University, is an open source implementation based on MPICH and optimized for InfiniBand networks. MVAPICH2 will automatically select the optimal network interface on compute nodes.
For single-threaded MPI programs, use the following setup:
module purge
module load gcc/13.3.0
module load mvapich2/2.3.7
export MV2_USE_RDMA_CM=0
export MV2_HOMOGENEOUS_CLUSTER=1
srun --mpi=pmi2 -n $SLURM_NTASKS ./mpi_program
For multi-threaded MPI programs, use the following setup:
module purge
module load gcc/13.3.0
module load mvapich2/2.3.7
export MV2_USE_RDMA_CM=0
export MV2_HOMOGENEOUS_CLUSTER=1
export MV2_ENABLE_AFFINITY=0
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
srun --mpi=pmi2 -n $SLURM_NTASKS -c $SLURM_CPUS_PER_TASK ./mpi_plus_openmp_program
0.0.9 Running Intel MPI programs
Intel MPI is an implementation based on MPICH that is optimized for Intel processors and integrates with other Intel tools (e.g., compilers and performance tools such as VTune). Intel MPI will automatically select the optimal network interface on compute nodes.
For single-threaded MPI programs, use the following setup:
module purge
module load intel-oneapi/2024.1.0
srun --mpi=pmi2 -n $SLURM_NTASKS ./mpi_program
For multi-threaded MPI programs, use the following setup:
module purge
module load intel-oneapi/2024.1.0
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
srun --mpi=pmi2 -n $SLURM_NTASKS -c $SLURM_CPUS_PER_TASK ./mpi_plus_openmp_program
0.0.10 Additional resources
If you have questions about or need help with MPI, please submit a help ticket and we will assist you.
Tutorials: