R is an open-source programming environment and language designed for statistical computing and graphics.
Using R on CARC systems
You can use R in either interactive or batch modes. In either mode, first load the corresponding software module:
module load r
This loads the default version of R, currently 4.0.0, and is equivalent to
module load r/4.0.0. If you require a different version, specify the version of R when loading. For example:
module load r/3.6.3
To see all available versions of R, enter:
module spider r
The R modules depend on the
openblas/0.3.8 modules, which are loaded by default when logging in (as part of the
usc module collection). These modules need to be loaded first because R was built with the GCC 8.3.0 compiler and linked to the OpenBLAS 0.3.8 linear algebra library for improved performance and multi-threading (implicit parallelism). Loading the modules also ensures that any R packages installed from source are built using these versions of GCC and OpenBLAS. In addition, loading an R module will automatically load a few common dependency modules.
If needed, the
openblas modules should be loaded before loading an R module:
module purge module load gcc/8.3.0 module load openblas/0.3.8 module load r/4.0.0
Or alternatively enter
module load usc and then load an R module.
Installing a different version of R
If you require a different version of R that is not currently installed, please submit a help ticket and we will install it for you. Alternatively, you could compile and install a different version of R from source within one of your directories or use a Singularity container with R installed.
Many popular R packages have already been installed and are available to use after loading one of the R modules. Enter the
library() function to view them. You can install other R packages that you need in your home or project directories (see the section on installing packages below).
Please note that we do not currently support the use of the RStudio IDE on CARC systems.
Running R in interactive mode
After loading the module, to run R interactively on the login node, simply enter
R and this will start a new R session. Using R on the login node should be reserved for installing packages and non-intensive work. Conversely, using R interactively on a compute node is necessary for more intensive work like exploring data, testing models, and debugging.
A common mistake for new users of HPC clusters is to run heavy workloads directly on a login node (e.g.,
endeavour.usc.edu). Unless you are only running a small test, please make sure to run your program as a job interactively on a compute node. Processes left running on login nodes may be terminated without warning. For more information on jobs, see our Running Jobs user guide.
To run R interactively on a compute node, first use Slurm's
salloc command to reserve job resources on a node:
user@discovery1:~$ salloc --time=2:00:00 --cpus-per-task=8 --mem=16GB --account=<project_id> salloc: Pending job allocation 24316 salloc: job 24316 queued and waiting for resources salloc: job 24316 has been allocated resources salloc: Granted job allocation 24316 salloc: Waiting for resource configuration salloc: Nodes d05-08 are ready for job
Make sure to change the resource requests (the
--time=2:00:00 --cpus-per-task=8 --mem=16GB --account=<project_id> part of your
salloc command) as needed, such as the number of cores and memory required. Also make sure to substitute your project ID, which is of the form
<PI_username>_<id>. You can find your project ID in the CARC User Portal.
Once you are granted the resources and logged in to a compute node, load the modules and then enter
user@d05-08:~$ module load usc r/4.0.0 user@d05-08:~$ R R version 4.0.0 (2020-04-24) -- "Arbor Day" Copyright (C) 2020 The R Foundation for Statistical Computing Platform: x86_64-pc-linux-gnu (64-bit) R is free software and comes with ABSOLUTELY NO WARRANTY. You are welcome to redistribute it under certain conditions. Type 'license()' or 'licence()' for distribution details. Natural language support but running in an English locale R is a collaborative project with many contributors. Type 'contributors()' for more information and 'citation()' on how to cite R or R packages in publications. Type 'demo()' for some demos, 'help()' for on-line help, or 'help.start()' for an HTML browser interface to help. Type 'q()' to quit R. >
Notice that the shell prompt changes from
user@<nodename> to indicate that you are now on a compute node (e.g.,
To run R scripts from within R, use the
source() function. Alternatively, to run R scripts from the shell, use the
To exit the node and relinquish the job resources, enter
q() to exit R and then enter
exit in the shell. This will return you to the login node:
> q() user@d05-08:~$ exit exit salloc: Relinquishing job allocation 24316 user@discovery1:~$
Please note that compute nodes do not have access to the internet, so any data downloads or package installations should be completed on the login or transfer nodes, either before the interactive job or concurrently in a separate shell session.
Running R in batch mode
In order to submit jobs to the Slurm job scheduler, you will need to use R in batch mode. There are a few steps to follow:
- Create an R script
- Create a Slurm job script that runs the R script
- Submit the job script to the job scheduler using
Your R script should consist of the sequence of R commands needed for your analysis. The
Rscript command, available after the R module has been loaded, runs R scripts, and it can be used in the shell during an interactive job as well as in Slurm job scripts.
A Slurm job script is a special type of Bash shell script that the Slurm job scheduler recognizes as a job. For a job running R, a Slurm job script should look something like the following:
#!/bin/bash #SBATCH --nodes=1 #SBATCH --ntasks=1 #SBATCH --cpus-per-task=8 #SBATCH --mem=16GB #SBATCH --time=1:00:00 #SBATCH --account=<project_id> module purge module load gcc/8.3.0 module load openblas/0.3.8 module load r/4.0.0 Rscript --vanilla script.R
Each line is described below:
|Command or Slurm argument||Meaning|
|Use Bash to execute this script|
|Syntax that allows Slurm to read your requests (ignored by Bash)|
|Use 1 compute node|
|Run 1 task (e.g., running an R script)|
|Reserve 8 CPUs for your exclusive use|
|Reserve 16 GB of memory for your exclusive use|
|Reserve resources described for 1 hour|
|Charge compute time to <project_id>. You can find your project ID in the CARC User Portal|
|Clear environment modules|
|Load the |
|Load the |
|Load the |
Make sure to adjust the resources requested based on your needs, but remember that fewer resources requested leads to less queue time for your job. Note that to fully utilize the resources, especially the number of cores, you may need to explicitly change your R code to do so (see the section on parallel programming below).
You can develop R scripts and job scripts on your local computer and then transfer them to one of your directories on CARC file systems, or you can use one of the available text editor modules (e.g.,
micro) to develop the scripts directly on CARC systems.
Save the job script as
R.job, for example, and then submit it to the job scheduler with Slurm's
user@discovery1:~$ sbatch R.job Submitted batch job 170554
To check the status of your job, enter
squeue -u <username>. For example:
user@discovery1:~$ squeue -u user JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 170554 main R.job user R 3:07 1 d11-04
If there is no job status listed, then this means the job has completed.
The results of the job will be logged and, by default, saved to a plain-text file of the form
slurm-<jobid>.out in the same directory where the job script was submitted from. To view the contents of this file, enter
less slurm-<jobid>.out, and then enter
q to exit the viewer.
For more information on running and monitoring jobs, see the Running Jobs guide.
Installing R packages
To install R packages, open an interactive session of R on a login node. This must be done on a login node because the compute nodes do not have access to the internet. Then use the
install.packages() function to install packages registered on CRAN. For example, to install the
skimr package, enter:
R may prompt you to use a personal library; enter
yes. R will then prompt you, by default, to create a personal library in your home directory (for example,
yes again. R may also prompt you to select a CRAN mirror; enter
1 for simplicity (the 0-Cloud mirror) or select a US-based mirror.
To load an R package, use the
library() function. For example:
You can also install packages to a different location. Using your project directory is useful for project libraries shared among your research team. The best way to install and load packages to other locations is by setting the environment variable
R_LIBS_USER. You can create this variable in the shell and set it to the path of the library location. For example:
R will then use that path as your default library instead of the one in your home directory. When you load and use R, you can use the
library() functions normally, but the packages will be installed to and loaded from the
R_LIBS_USER location. You can add this line to your
~/.bashrc to automatically set the
R_LIBS_USER variable every time you log in. When using a different location, make sure to have separate package libraries for different versions of R. To check your library locations within an R session, use the
For project libraries, also consider using the
renv package to create reproducible, project-specific R environments. See more information here.
To install unregistered or development versions of packages, such as from GitHub repos, use the
remotes package and its functions. For example:
Installing packages from Bioconductor
You can install packages from Bioconductor using the
BiocManager package and the
BiocManager::install() function. For example, to install the
GenomicFeatures package, enter:
See more information about
Loading dependency modules
Some R packages have system dependencies, and the modules for these dependencies should be loaded before starting R and installing the packages. For example, the
xml2 package requires the
libxml2 library, so in this case load the associated module with
module load libxml2 and then load and start R and enter
install.packages("xml2"). For some packages, you may also need to specify header and library locations for dependencies when installing.
To search for available modules for dependencies, use the
module keyword <keyword> command, with
<keyword> being the name of the dependency. If you cannot find a needed module, please submit a help ticket and we will install it for you.
Parallel programming with R
R is a serial (i.e., single-core/single-threaded) programming language by default, but with additional libraries and packages it also supports both implicit and explicit parallel programming to enable full use of multi-core processors and compute nodes. This also includes the use of shared memory on a single node or distributed memory on multiple nodes. On CARC systems, 1 thread = 1 core = 1 logical CPU (requested with Slurm's
Parallelizing your code to use multiple cores or nodes can reduce the runtime of your R jobs, but the speedup does not necessarily increase in a proportional manner. The speedup depends on the scale and types of computations that are involved. Furthermore, sometimes using a single core is optimal. There is a cost to setting up parallel computation (e.g., modifying code, communications overhead, etc.), and that cost may be greater than the achieved speedup, if any, of the parallelized version of the code. Some experimentation will be needed to optimize your code and resource requests (optimal number of cores and amount of memory). Also keep in mind that your project account will be charged CPU-minutes based on the cores reserved for a job, even if all those cores are not actually used during the job.
Implicit parallelism with OpenBLAS and OpenMP
Some R packages and their functions use implicit parallelism via multi-threading, so that you do not need to explicitly call for parallel computation in your R code. The R modules on CARC systems are built with an optimized, multi-threaded OpenBLAS library for linear algebra operations. Therefore, any R functions that do linear algebra operations (e.g., matrix manipulations, regression modeling, etc.) will automatically detect and use the available number of cores as needed. In addition, many R packages have interfaces to C++ programs, and when they are installed on Linux systems like the CARC clusters they will be compiled from source and with OpenMP support, which enables multi-threading. As a result, requesting multiple cores in your Slurm jobs with the
--cpus-per-task option will enable implicit parallelism via automatic multi-threading.
If needed, to explicitly set the number of cores to use, you can set the environment variable
OMP_NUM_THREADS in the shell or in your job script (e.g.,
export OMP_NUM_THREADS=8). Some R packages also have functions to set the number of cores to use, and some functions also have arguments to set the number of cores to use. If not specified, typically there are default values used, which may not be optimal.
Explicit parallelism with various packages
Explicit parallelism means explicitly calling for parallel computation in your R code, either in relatively simple ways or potentially in more complex ways depending on the tasks to be performed. Many R packages exist for explicit parallelism, designed for different types of tasks that can be parallelized. A common task in R programming is iteration. For example, applying the same function to multiple data inputs or running the same model multiple times while varying parameters. Another common task is performing multiple, different computations as part of a workflow and then once they are all completed combining or using their outputs in some further computations. These types of tasks can be explicitly parallelized.
The main R packages for explicit parallelism are summarized as follows:
|parallel||Primarily for iteration|
|foreach||For iteration, with parallel backend (e.g., |
|pbdMPI||For general multi-node computing|
|BiocParallel||For parallel computing with Bioconductor objects|
|future||For asynchronous evaluations (futures)|
|rslurm||For submitting Slurm jobs from within R (e.g., for iteration)|
|slurmR||For submitting Slurm jobs from within R (e.g., for iteration)|
|targets||For defining and running workflows|
Please review the linked documentation above for examples and more information about how to use these packages and their functions.
For more information about high-performance computing with R, see our workshop materials for HPC with R as well as the resources linked below.
If you have questions about or need help with R, please submit a help ticket and we will assist you.
R Package Documentation
R for Reproducible Scientific Analysis lessons
CRAN Task View on High-Performance and Parallel Computing with R
Programming with Big Data in R
Southern California R Users Group
Hands-on Programming with R
An Introduction to R
R Programming for Data Science
R for Data Science
Efficient R Programming
Tidy Modeling with R
R Graphics Cookbook
Data Visualization with R
Geocomputation with R
Mastering Software Development in R
Tidyverse Style Guide
Tidyverse Design Guide
CARC R workshop materials: