Data Science on HPC Systems
There are two ways to run data science scripts on CARC clusters:
- submit a job script by running Anaconda in batch mode
- use an existing container and submit your script by running a Singularity container
CARC recommends using the first approach, as building a Conda environment does not require much storage space and users have the flexibility to add or change data science packages within the Conda environment.
Using a container is more useful for running applications with complex software dependencies. There are existing container images (e.g. Docker or Singularity) available for users. Refer to the documentation on how to use Singularity to run jobs on the cluster.
0.0.1 Submitting a Slurm job script with Anaconda
Before submitting your job script to the cluster, we recommend to first test your script interactively on a compute node.
To use your Conda environment interactively on a compute node, follow these two steps:
- Reserve job resources on a node using Slurm’s
salloc
command.
The example salloc
command below requests for one GPU, 8 CPU cores, 32GB memory in the gpu partition with a time limit of 1 hour.
[user@discovery1 ~]$ salloc --partition=gpu --gres=gpu:1 --cpus-per-task=8 --mem=32GB --time=1:00:00
salloc: Pending job allocation 15731446
salloc: job 15731446 queued and waiting for resources
salloc: job 15731446 has been allocated resources
salloc: Granted job allocation 15731446
salloc: Waiting for resource configuration
salloc: Nodes a02-15 are ready for job
[user@a02-15 ~]$
Change the resource requests (the --cpus-per-task=8 --mem=32GB --time=1:00:00
part after your salloc
command) as needed, such as the number of cores, memory required and the time needed for installation.
The shell prompt changes from user@discovery1
to user@<nodename>
to indicate that you are now on a compute node (e.g., a02-15
).
- Once resources are allocated, activate your Conda environment and run the application:
[user@a02-15 ~]$ module purge
[user@a02-15 ~]$ module load conda
[user@a02-15 ~]$ conda activate <env_name>
(env_name) [user@a02-15 ~]$ python script.py
<env_name>
is the name of your Conda environment.
The sample python script script.py is your application script. After you run your script, the compute node should generate your desired result.
To exit the compute node and relinquish the job resources, enter exit
in the shell. This will return you to the login node:
(env_name) [user@a02-15 ~]$ exit
exit
salloc: Relinquishing job allocation 15731446
[user@discovery1 ~]$
0.0.2 Submit job script by running Anaconda in batch mode
In order to submit jobs to the Slurm job scheduler, use the main application you are using with your Conda environment in batch mode. The steps are:
- Create an application script
- Create a Slurm job script that runs the application script
- Submit the job script to the job scheduler with
sbatch
Your application script should consist of the sequence of commands needed for your analysis.
A Slurm job script is a special type of Bash shell script that the Slurm job scheduler recognizes as a job. For a job using Anaconda, a Slurm job script should look something like the following:
#!/bin/bash
#SBATCH --account=<project_id>
#SBATCH --partition=gpu
#SBATCH --gres=gpu:1
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=8
#SBATCH --mem=32GB
#SBATCH --time=2:00:00
module purge
eval "$(conda shell.bash hook)"
conda activate <env_name>
python script.py
Each line is described below:
Command or Slurm argument | Meaning |
---|---|
#!/bin/bash |
Use Bash to execute this script |
#SBATCH |
Syntax that allows Slurm to read your requests (ignored by Bash) |
--account=<project_id> |
Charge compute time to <project_id>. You can find your project ID in the CARC user portal |
--partition=main |
Submit job to the main partition |
--gres=gpu:1 |
Use 1 GPU |
--nodes=1 |
Use 1 compute node |
--ntasks=1 |
Run 1 task (e.g., running a python script) |
--cpus-per-task=8 |
Reserve 8 CPUs for your exclusive use |
--mem=32GB |
Reserve 32 GB of memory for your exclusive use |
--time=2:00:00 |
Reserve resources described for 2 hour |
module purge |
Clear environment modules |
eval "$(conda shell.bash hook)" |
Initialize the shell to use Conda |
conda activate <env_name> |
Activate your Conda environment <env_name> |
python script.py |
Use python to run script.py |
Adjust the resources requested based on your needs. Fewer resources requested leads to less queue time for your job. To fully utilize the resources, especially the number of cores, you may need to explicitly change your application script to do so.
Develop application scripts and job scripts on your local machine and then transfer them to the cluster, or use one of the available text editor modules (e.g., nano, micro, vim) to develop them directly on the cluster.
Save the job script as py.job
, for example, and then submit it to the job scheduler with Slurm’s sbatch
command:
[user@discovery1 ~]$ sbatch py.job
Submitted batch job 15788866
To check the status of your job, enter myqueue
. If there is no job status listed, then this means the job has completed.
The results of the job are logged and, by default, saved to a plain-text file of the form slurm-<jobid>.out
in the same directory where the job script was submitted from. To view the contents of this file, enter less slurm-<jobid>.out
, and then enter q
to exit the viewer.
For more information on running and monitoring jobs, see the Running Jobs guide.