Using R

R is an open-source programming environment and language designed for statistical computing and graphics.

Intro to R video

HPC with R video

Using R on CARC systems

Begin by logging in. You can find instructions for this in the Getting Started with Discovery or Getting Started with Endeavour user guides.

You can use R in either interactive or batch modes. You can use interactive mode to install packages and explore data, for example, and you can use batch mode to run R scripts remotely.

To use R, in either mode, first load the corresponding software module:

module load r

This loads the default version of R, currently 4.0.0, and is equivalent to module load r/4.0.0. If you require a different version, specify the version of R when loading. For example:

module load r/3.6.3

To see all available versions of R, enter:

module spider r

The R modules depend on the gcc/8.3.0 and openblas/0.3.8 modules, which are loaded by default when logging in. These modules need to be loaded first because R was built with the GCC 8.3.0 compiler and linked to the OpenBLAS 0.3.8 linear algebra library for improved performance and multi-threading (implicit parallelism). Loading the modules also ensures that any R packages installed from source are built with these versions of GCC and OpenBLAS. In addition, loading an R module will automatically load a few common dependency modules.

In Slurm job scripts, the gcc and openblas modules should be loaded explicitly before loading R:

module purge
module load gcc/8.3.0
module load openblas/0.3.8
module load r/4.0.0

Installing a different version of R

If you require a different version of R that is not currently installed, please submit a help ticket and we will install it for you. Alternatively, you can compile and install a different version of R from source within your project directory.

Pre-installed packages

Many popular R packages have already been installed and are available to use after loading one of the R modules. Enter the library() function to view them. You can install other R packages that you need in your home or project directories (see the section on installing packages below).

RStudio

Please note that we do not currently support the use of the RStudio IDE on CARC systems.

Running R in interactive mode

After loading the module, to run R interactively on the login node, simply enter R and this will start a new R session:

user@discovery1:~$ module load r
user@discovery1:~$ R
  
R version 4.0.0 (2020-04-24) -- "Arbor Day"
Copyright (C) 2020 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu (64-bit)
  
R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.
  
  Natural language support but running in an English locale
  
R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.
  
Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.
  
>

Using R on the login node should be reserved for installing packages and non-intensive work.

Conversely, using R interactively on a compute node is useful for more intensive work like exploring data, testing models, and debugging.

To run R interactively on a compute node, first use Slurm's salloc command to reserve resources on a node:

user@discovery1:~$ salloc --time=2:00:00 --cpus-per-task=8 --mem=16GB --account=<account_id>
salloc: Pending job allocation 24316
salloc: job 24316 queued and waiting for resources
salloc: job 24316 has been allocated resources
salloc: Granted job allocation 24316
salloc: Waiting for resource configuration
salloc: Nodes d05-08 are ready for job

Make sure to change the resource requests (the --time=2:00:00 --cpus-per-task=8 --mem=16GB --account=<account_id> part after your salloc command) as needed, such as the number of cores and memory required. Also make sure to substitute your project account ID.

Once you are granted the resources and logged in to a compute node, load the modules and then enter R:

user@d05-08:~$ module load gcc/8.3.0 openblas/0.3.8 r/4.0.0
user@d05-08:~$ R
  
R version 4.0.0 (2020-04-24) -- "Arbor Day"
Copyright (C) 2020 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu (64-bit)
  
R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.
  
  Natural language support but running in an English locale
  
R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.
  
Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.
  
>

Notice that the shell prompt changes from user@discovery1 to user@<nodename> to indicate that you are now on a compute node (e.g., d05-08).

To run R scripts from within R, use the source() function. Alternatively, to run R scripts from the shell, use the Rscript command, after loading the R module.

To exit the node and relinquish the job resources, enter q() to exit R and then enter exit in the shell. This will return you to the login node:

> q()
  
user@d05-08:~$ exit
exit
salloc: Relinquishing job allocation 24316
  
user@discovery1:~$

Note: Compute nodes do not have access to the public internet, so any data downloads or package installations should first be completed on the login node.

Running R in batch mode

In order to submit jobs to the Slurm job scheduler, you will need to use R in batch mode. There are a few steps to follow:

  1. Create an R script
  2. Create a Slurm job script that runs the R script
  3. Submit the job script to the job scheduler using sbatch

Your R script should consist of the sequence of R commands needed for your analysis. The Rscript command, available after the R module has been loaded, runs R scripts, and it can be used in the shell during an interactive job as well as in Slurm job scripts.

A Slurm job script is a special type of Bash shell script that the Slurm job scheduler recognizes as a job. For a job running R, a Slurm job script should look something like the following:

#!/bin/bash

#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=8
#SBATCH --mem=16GB
#SBATCH --time=1:00:00
#SBATCH --account=<account_id>

module purge
module load gcc/8.3.0
module load openblas/0.3.8
module load r/4.0.0

Rscript --vanilla script.R

Each line is described below:

Command or Slurm argumentMeaning
#!/bin/bashUse Bash to execute this script
#SBATCHSyntax that allows Slurm to read your requests (ignored by Bash)
--nodes=1Use 1 compute node
--ntasks=1Run 1 task (e.g., running an R script)
--cpus-per-task=8Reserve 8 CPUs for your exclusive use
--mem=16GBReserve 16 GB of memory for your exclusive use
--time=1:00:00Reserve resources described for 1 hour
--account=<account_id>Charge compute time to <account_id>. If not specified, you may use up the wrong PI's compute hours
module purgeClear environment modules
module load gcc/8.3.0Load the gcc compiler environment module
module load openblas/0.3.8Load the openblas environment module
module load r/4.0.0Load the r environment module. Note that R requires both gcc and openblas
Rscript --vanilla script.RUse Rscript to run script.R. The --vanilla option will ensure a clean R session that helps with the reproducibility of jobs

Make sure to adjust the resources requested based on your needs, but remember that fewer resources requested leads to less queue time for your job. Note that to fully utilize the resources, especially the number of cores, you may need to explicitly change your R code to do so (see the section on parallel programming below).

You can develop R scripts and job scripts on your local computer and then transfer them to CARC storage, or you can use one of the available text editor modules to develop them directly on the cluster (nano, vim, or emacs).

Save the job script as R.job, for example, and then submit it to the job scheduler with Slurm's sbatch command:

user@discovery1:~$ sbatch R.job
Submitted batch job 170554

To check the status of your job, enter squeue -u <username>. For example:

user@discovery1:~$ squeue -u user
         JOBID PARTITION     NAME     USER     ST    TIME  NODES NODELIST(REASON)
        170554      main    R.job     user      R    3:07      1 d11-04

If there is no job status listed, then this means the job has completed.

The results of the job will be logged and, by default, saved to a plain-text file of the form slurm-<jobid>.out in the same directory where the job script was submitted from. To view the contents of this file, enter less slurm-<jobid>.out, and then enter q to exit the viewer.

For more information on running and monitoring jobs, see the Running Jobs guide.

Installing R packages

To install R packages, open an interactive session of R on a login node. This must be done on a login node because the compute nodes do not have access to the internet. Then use the install.packages() function to install packages registered on CRAN. For example, to install the skimr package, enter:

install.packages("skimr")

R may prompt you to use a personal library; enter yes. R will then prompt you, by default, to create a personal library in your home directory (for example, ~/R/x86_64-pc-linux-gnu-library/4.0); enter yes again. R may also prompt you to select a CRAN mirror; enter 1 for simplicity (the 0-Cloud mirror) or select a US-based mirror.

To load an R package, use the library() function. For example:

library(skimr)

You can also install packages to a different location. Using your project directory is useful for project libraries shared among your research team. The best way to install and load packages to other locations is by setting the environment variable R_LIBS_USER. You can create this variable in the shell and set it to the path of the library location:

export R_LIBS_USER=/project/ttrojan_123/R/pkgs/4.0

R will then use that path as your default library instead of the one in your home directory. When you load and use R, you can use the install.packages() and library() functions normally, but the packages will be installed and loaded from the R_LIBS_USER location. You can add this line to your ~/.bashrc to automatically set the R_LIBS_USER variable every time you log in. When using a different location, make sure to have separate libraries for different versions of R. To check your library locations within an R session, use the .libPaths() function.

For project libraries, also consider using the renv package to create reproducible, project-specific R environments. See more information here.

To install unregistered or development versions of packages, such as from GitHub repos, use the remotes package and its functions. For example:

remotes::install_github("USCbiostats/slurmR")

Loading dependency modules

Some R packages have system dependencies, and the modules for these dependencies should be loaded before starting R and installing the packages. For example, the xml2 package requires the libxml2 library, so in this case load the associated module with module load libxml2 and then load and start R and enter install.packages("xml2").

To search for available modules for dependencies, use the module keyword <keyword> command, with the keyword being the name of the dependency. If you cannot find a needed module, please submit a help ticket and we will install it for you.

Installing packages from Bioconductor

You can install packages from Bioconductor using the BiocManager package, which is pre-installed, and the BiocManager::install() function.

For example, to install the GenomicFeatures package, use:

BiocManager::install("GenomicFeatures")

See more information about BiocManager here.

Parallel programming with R

R is a serial (i.e., single-core/single-threaded) programming language by default, but with additional libraries and packages it also supports both implicit and explicit parallel programming to enable full use of multi-core processors and compute nodes. This includes the use of shared memory on a single node or distributed memory on multiple nodes. Using these packages and functions can improve the performance of your R jobs.

Note: On CARC systems, 1 thread = 1 core = 1 logical CPU

Some R packages and their functions utilize implicit parallelism, where you do not need to explicitly call for parallel computation by modifying your R code. These packages will typically automatically detect and use the available cores. In addition, there are a number of R packages that support explicit parallelism, including the base parallel package as well as the foreach, future, BiocParallel, and pbdMPI packages, among others. These require changing your R code, either in relatively simple ways or potentially in more significant ways depending on the task to be performed.

Increasing the number of cores can speed up run times for your R jobs, though it does not necessarily scale in a proportional manner. The speedup depends on the scale and types of computations that are involved. There is a cost to setting up parallel computation (e.g., modifying code, communications overhead, etc.), and in some cases that cost may be greater than the actual speedup of the parallel computation. Performance from parallelizing may not improve and may even decrease. Some experimentation may be needed to optimize your code and resource requests. Also keep in mind that your project account will still be charged core-hours even if the cores that you request are not actually used.

Implicit parallelism with OpenBLAS and OpenMP

The R modules on CARC systems are built with an optimized, multi-threaded OpenBLAS library for linear algebra operations. Therefore, any R functions that do linear algebra operations (e.g., matrix manipulations, regression modeling, etc.) will automatically use the available number of cores if needed. In addition, many R packages are actually written in C++, and when they are installed on Linux systems like the CARC clusters they will be compiled from source and with OpenMP support, which enables multi-threading. As a result, if you request multiple cores in your Slurm job script with the --cpus-per-task option, this will enable implicit parallelism via automatic multi-threading. In this case, you do not actually have to modify your R code in any way.

If needed, to explicitly set the number of threads to use, you can set the environment variable OMP_NUM_THREADS in the shell or in your job script (e.g., export OMP_NUM_THREADS=8). Some packages also have functions to set the number of threads/cores to use, and some functions also have arguments to set the number of threads/cores to use.

Explicit parallelism with the parallel package

A common task in R is applying the same function to multiple data inputs, such as different samples or subsets from the same dataset. Another common task is applying the same function to the same dataset but with varying argument inputs, such as alternative model specificiations or simulation runs. Typically, these tasks are accomplished through the use of the base *apply() family of functions, which loop through the inputs, apply the function to each input in sequence, and then return the results (typically as a list object). The parallel package contains parallel versions of these functions, which will perform a certain number of these iterations concurrently based on the number of cores available and thus speed up the time to completion.

The following example uses the mclapply() function from the base parallel package, using a single node with multiple cores. The task in this example is to run the same regression model on multiple datasets:

# Parallel programming with R

library(parallel)

RNGkind("L'Ecuyer-CMRG")

# Define number of cores based on Slurm job script
cores <- as.numeric(Sys.getenv("SLURM_CPUS_PER_TASK")) - 1

# Create large datasets in parallel with same number of variables and store in list
datasets <- mclapply(1:200, function(x) data.frame(matrix(rexp(1000000), ncol = 1000)), mc.cores = cores)

# The serial analog is: lapply(1:200, function(x) data.frame(matrix(rexp(1000000), ncol = 1000)))

# Create model with same formula but accepting different data inputs
model <- function(x) {
  xnames <- paste0("X", 2:1000)
  formula <- as.formula(paste("X1 ~ ", paste(xnames, collapse = "+")))
  lm(formula, x)
}

# Run models in parallel and store results in list
results <- mclapply(datasets, model, mc.cores = cores)

# The serial analog is: lapply(datasets, model)

A key step here is to match the requested --cpus-per-task in your Slurm job script to the number of cores used in your code, although you will typically want to reserve one core for overhead. This can be accomplished automatically using the Slurm environment variable SLURM_CPUS_PER_TASK, which equals the --cpus-per-task requested in the Slurm job script. The line cores <- as.numeric(Sys.getenv("SLURM_CPUS_PER_TASK")) - 1 defines the number of cores dynamically, so that you do not need to update your R code if you change the resources requested in the Slurm job script. The mc.cores argument for the mclapply() function then simply instructs R to use that number of cores for parallel computation.

Please note that mixing implicit and explicit parallelism can lead to conflicts and degrade the performance of your R job. When using explicit parallelism, we recommend setting the number of threads to 1 to disable multi-threading: export OMP_NUM_THREADS=1.

A Slurm job script for this example would look like the following:

#!/bin/bash

#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=12
#SBATCH --mem=32GB
#SBATCH --time=1:00:00
#SBATCH --account=<account_id>

module purge
module load gcc/8.3.0
module load openblas/0.3.8
module load r/4.0.0

export OMP_NUM_THREADS=1

Rscript --vanilla script.R

Additional resources

If you have questions about or need help with R, please submit a help ticket and we will assist you.

R Project
R Manuals
R Package Documentation
R for Reproducible Scientific Analysis lessons
CRAN Task View on High-Performance and Parallel Computing with R
Programming with Big Data in R
rOpenSci
Bioconductor
R Cheatsheets
Southern California R Users Group

Web books:

Hands-on Programming with R
An Introduction to R
R Programming for Data Science
R for Data Science
Advanced R
Efficient R Programming
R Graphics Cookbook
Data Visualization with R
Geocomputation with R
R Packages
Mastering Software Development in R
Tidyverse Style Guide
Tidyverse Design Guide

R workshop materials:

Intro to R
HPC with R

Back to top