Genome Analysis Toolkit (GATK)

Last updated November 04, 2023

Genome Analysis Toolkit (GATK) offers a wide variety of tools with a primary focus on variant discovery and genotyping. Using GATK on a CARC cluster requires an understanding of both GATK for genome analysis and SLURM for job scheduling. Here’s a short user guide to get you started.

0.0.1 Load the GATK module to use it in interactive mode

module purge
module load usc
module load openjdk
module load gatk

0.0.2 or write a SLURM batch script and submit it to the cluster

Create a batch script (gatk_job.slurm) to submit your job to the SLURM scheduler.

#!/bin/bash
#SBATCH --job-name=gatk-analysis
#SBATCH --account=ttrojan_123
#SBATCH --partition=main
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=4
#SBATCH --mem=16G
#SBATCH --time=24:00:00
#SBATCH --output=gatk_out_%j.txt
#SBATCH --error=gatk_err_%j.txt

module purge
module load usc
module load openjdk
module load gatk

# Navigate to your working directory (if necessary)
cd /path/to/your/working/directory

# Run GATK commands
gatk --java-options "-Xmx12G" HaplotypeCaller \
    -R reference.fasta \
    -I input.bam \
    -O output.vcf

Customize the script as necessary for your job requirements, including specifying the partition, the number of nodes, tasks, memory, and runtime.

0.0.3 Submit Your Job

Use the sbatch command to submit your script to SLURM.

sbatch gatk_job.slurm

0.0.4 View Job Output

Once the job completes, you can view the output and error files specified in the SLURM script.

cat gatk_out_<jobid>.txt
cat gatk_err_<jobid>.txt

0.0.4.1 Additional Tips

Always check the documentation for the specific version of GATK you’re using, as parameters and recommended practices can change between versions.

Make use of GATK’s parallel processing capabilities where appropriate to optimize runtime. Regularly check your job’s resource usage to ensure you’re requesting appropriate resources from the cluster; this helps optimize cluster workload and job efficiency.

If you have multiple GATK jobs or a pipeline that involves several steps, consider automation tools such as Snakemake or Nextflow, which are well-suited for workflow management on clusters.

By following these steps, you should be able to effectively use GATK on a SLURM cluster for your genomic analyses. Remember to also refer to your cluster’s specific documentation for any particular configurations or restrictions they might have.