Data Science on HPC Systems

Last updated July 05, 2023

There are two ways to run data science scripts on CARC clusters:

  • submit a job script by running Anaconda in batch mode
  • use an existing container and submit your script by running a Singularity container

CARC recommends using the first approach, as building a Conda environment does not require much storage space and users have the flexibility to add or change data science packages within the Conda environment.

Using a container is more useful for running applications with complex software dependencies. There are existing container images (e.g. Docker or Singularity) available for users. Refer to the documentation on how to use Singularity to run jobs on the cluster.

0.0.1 Submitting a Slurm job script with Anaconda

Before submitting your job script to the cluster, we recommend to first test your script interactively on a compute node.

To use your Conda environment interactively on a compute node, follow these two steps:

  1. Reserve job resources on a node using Slurm’s salloc command.

The example salloc command below requests for one GPU, 8 CPU cores, 32GB memory in the gpu partition with a time limit of 1 hour.

[user@discovery1 ~]$ salloc --partition=gpu --gres=gpu:1 --cpus-per-task=8 --mem=32GB --time=1:00:00 
salloc: Pending job allocation 15731446
salloc: job 15731446 queued and waiting for resources
salloc: job 15731446 has been allocated resources
salloc: Granted job allocation 15731446
salloc: Waiting for resource configuration
salloc: Nodes a02-15 are ready for job
[user@a02-15 ~]$ 

Change the resource requests (the --cpus-per-task=8 --mem=32GB --time=1:00:00 part after your salloc command) as needed, such as the number of cores, memory required and the time needed for installation.

The shell prompt changes from user@discovery1 to user@<nodename> to indicate that you are now on a compute node (e.g., a02-15).

  1. Once resources are allocated, activate your Conda environment and run the application:
[user@a02-15 ~]$ module purge
[user@a02-15 ~]$ module load conda
[user@a02-15 ~]$ conda activate <env_name>
(env_name) [user@a02-15 ~]$ python script.py

<env_name> is the name of your Conda environment.

The sample python script script.py is your application script. After you run your script, the compute node should generate your desired result.

To exit the compute node and relinquish the job resources, enter exit in the shell. This will return you to the login node:

(env_name) [user@a02-15 ~]$ exit
exit
salloc: Relinquishing job allocation 15731446
[user@discovery1 ~]$

0.0.2 Submit job script by running Anaconda in batch mode

In order to submit jobs to the Slurm job scheduler, use the main application you are using with your Conda environment in batch mode. The steps are:

  • Create an application script
  • Create a Slurm job script that runs the application script
  • Submit the job script to the job scheduler with sbatch

Your application script should consist of the sequence of commands needed for your analysis.

A Slurm job script is a special type of Bash shell script that the Slurm job scheduler recognizes as a job. For a job using Anaconda, a Slurm job script should look something like the following:

#!/bin/bash

#SBATCH --account=<project_id>
#SBATCH --partition=gpu
#SBATCH --gres=gpu:1
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=8
#SBATCH --mem=32GB
#SBATCH --time=2:00:00

module purge

eval "$(conda shell.bash hook)"

conda activate <env_name>

python script.py

Each line is described below:

Command or Slurm argument Meaning
#!/bin/bash Use Bash to execute this script
#SBATCH Syntax that allows Slurm to read your requests (ignored by Bash)
--account=<project_id> Charge compute time to <project_id>. You can find your project ID in the CARC user portal
--partition=main Submit job to the main partition
--gres=gpu:1 Use 1 GPU
--nodes=1 Use 1 compute node
--ntasks=1 Run 1 task (e.g., running a python script)
--cpus-per-task=8 Reserve 8 CPUs for your exclusive use
--mem=32GB Reserve 32 GB of memory for your exclusive use
--time=2:00:00 Reserve resources described for 2 hour
module purge Clear environment modules
eval "$(conda shell.bash hook)" Initialize the shell to use Conda
conda activate <env_name> Activate your Conda environment <env_name>
python script.py Use python to run script.py

Adjust the resources requested based on your needs. Fewer resources requested leads to less queue time for your job. To fully utilize the resources, especially the number of cores, you may need to explicitly change your application script to do so.

Develop application scripts and job scripts on your local machine and then transfer them to the cluster, or use one of the available text editor modules (e.g., nano, micro, vim) to develop them directly on the cluster.

Save the job script as py.job, for example, and then submit it to the job scheduler with Slurm’s sbatch command:

[user@discovery1 ~]$ sbatch py.job
Submitted batch job 15788866

To check the status of your job, enter myqueue. If there is no job status listed, then this means the job has completed.

The results of the job are logged and, by default, saved to a plain-text file of the form slurm-<jobid>.out in the same directory where the job script was submitted from. To view the contents of this file, enter less slurm-<jobid>.out, and then enter q to exit the viewer.

For more information on running and monitoring jobs, see the Running Jobs guide.