Managing Research Data

Last updated March 11, 2024

0.0.1 Managing data quotas

If your project requires more or less storage space, submit a help ticket under the “Accounts/Access” category. Include your project ID, desired allocation size, and reason for this increase or decrease. The CARC team will consult with you to determine your needs and the total cost.

0.0.2 Managing data using the command line

The following sections describe how to use command-line tools to manage data on CARC systems. To manage data with a graphical user interface, you can use CARC OnDemand or an SFTP GUI app.

Currently, CARC systems do not support the use or storage of sensitive data. If your research work includes sensitive data, including but not limited to HIPAA-, FERPA-, or CUI-regulated data, see our Secure Computing Compliance Overview or contact us at carc-support@usc.edu before using our systems.

0.0.2.1 Organizing data

Project files should be organized within a directory structure of some kind in order to keep files organized, documented, and findable. This may include, for example, having separate directories for raw data, processed data, and code.

To list files and directories, use the ls command. For example, to list files in long format for the current directory use:

ls -l

For other directories, add the directory path to the command. Enter man ls or ls --help for more information and to view all available options.

To create a directory, use the mkdir command:

mkdir directory_name

Enter man mkdir or mkdir --help for more information and to view all available options.

To copy files or directories, use the cp command:

cp /source/path /destination/path

For example, to copy a directory on /scratch1 to /project, use:

cp -r /scratch1/ttrojan/dir /project/ttrojan_123/

The -r option, recursive mode, is needed when copying directories. To print a log of the copying, add the -v option, which enables verbose mode. To copy multiple files or directories to the same destination, simply include additional source paths in the command. Enter man cp or cp --help for more information and to view all available options.

Do not use the -a or -p options with cp if you are copying into a CARC project directory, because this will likely result in incorrect group ownership of files that will produce a “disk quota exceeded” error.

To move files or directories (i.e., copy and also remove the files from the source), use the mv command instead:

mv /source/path /destination/path

To rename files, you can also use the mv command:

mv /source/filename.txt /source/newfilename.txt

Do not use mv if you are moving files from a home or scratch directory into a CARC project directory. This results in incorrect group ownership of files that will produce a “disk quota exceeded” error. Use cp -r instead to copy the files and then use rm to remove the source files.

If you are backing up and syncing a directory, use an rsync command. For example:

rsync /source/dir/ /destination/dir/

Rsync will copy only files that are new or have changed in the source directory. Enter man rsync or rsync --help for more information and to view all available options.

Do not use the -a or -p options with rsync if you are copying into a CARC project directory, because this will result in incorrect group ownership of files that will produce a “disk quota exceeded” error. Use the options rsync -rlt instead.

To delete files or directories, use the rm command:

rm /path/to/file

For example, to delete a directory, use:

rm -r /scratch1/ttrojan/dir

The -r option, recursive mode, is needed to remove directories. To remove multiple files or directories, simply add additional paths to the command. Enter man rm or rm --help for more information and to view all available options.

0.0.2.2 Checking file disk usage

To check the disk usage of files and directories, use the du -h command:

du -h /path/to/file

Please note that all file systems run ZFS which compresses files, so the file size on disk may be smaller than the actual file size (on your local computer, for example). Using the du --apparent-size -h command will give the uncompressed file size, and the ls -lh command should give the same result. Enter man du or du --help for more information and to view all available options.

To list the files or subdirectories in the current directory and sort by size, enter the command cdiskusage. This is a convenience script that uses the du command. Please note that it may take a long time to run for large directories (e.g., the root of a project directory).

0.0.2.3 Sharing data

The /project directories are the best place to share files. By default, the members of a project group will have full read, write, and execute permissions for all files in a project directory (i.e., permissions set to 770 = drwxrwx—).

You can check the current permissions for a file or directory with the command ls -l /path/to/file.

When sharing your files, please keep the following in mind:

  • Never set the permissions of your directories to 777 (drwxrwxrwx), which means that any other user on CARC systems can access, modify, and delete your files.
  • Do not share or change the permissions of your /home1 directory and its subdirectories. If something goes wrong, you may be blocked from logging in because SSH requires strict permissions for logging in.
  • Granting other users read permission for your files (r--) and read and execute permissions (r-x) for your directories is typically sufficient for sharing. Granting write permission can result in modified or deleted files, so only provide write permission when actually needed.

You can change file and directory permissions using a chmod command.

For example, to provide read and execute permissions but not write permission (r-x) to a project subdirectory for your project group, use:

chmod 750 /project/ttrojan_123/dir

If the subdirectory is actually located within another subdirectory, note that the group would also need read and execute permission to the full hierarchy of subdirectories. Granting write permission to a directory allows users to create, modify, or delete files in that directory, also depending on individual file permissions. Enter man chmod or chmod --help for more information and to view all available options.

0.0.3 Managing file I/O

File input/output (I/O) refers to reading and writing data. The following guide offers advice on managing file I/O for your compute jobs.

0.0.3.1 Best practices

Some I/O best practices:

  • Try to avoid disk I/O, especially for workflows that create a large number of files. Process data in memory when possible instead of writing to and reading from the disk. This will provide the best performance, though the size of the data and subsequent memory requirements may place limits on this strategy.
  • Use the local /tmp directory—the default location—on compute nodes for small-scale I/O, a RAM-based file system (tmpfs). Files are saved in memory, allowing for better performance than saving to the disk.

Note: the /tmp directory is limited to 1 GB of space and is shared among jobs running on the same node. The files are removed when the job ends. The size of the files and job memory requirements may place limits on this strategy.

  • Use the local /dev/shm directory on compute nodes for large-scale I/O, which is also a RAM-based file system (tmpfs). The space is limited by the memory requested for your job and includes a hard limit of half the total memory available on a node. Like /tmp, files are saved in memory for better performance than saving to the disk.

Note: files are removed when the job ends. The size of the files and job memory requirements may also place limits on this strategy.

  • Use your /scratch1 directory for disk I/O when needed, which is located on a high-performance, parallel file system.
  • Use high-level I/O libraries and file formats like HDF5 or NetCDF. These enable fast I/O through a single file format and parallel operations. The file formats are also portable across computing systems.

0.0.3.2 Redirecting temporary files

The default value of the environment variable TMPDIR for compute jobs will look like /tmp/SLURM_<job_id>. To automatically redirect temporary files from this /tmp location to another location, change the TMPDIR variable. For example, create a tmp subdirectory in your /scratch1 directory and then enter the following:

export TMPDIR=/scratch1/<username>/tmp

Include this line in job scripts to set the TMPDIR for batch jobs.

0.0.3.3 Staging data

Some jobs may require staging data in and out of temporary directories, such as when using the tmpfs file systems /tmp or /dev/shm.

0.0.3.3.1 Beginning of job

You may need to stage data at the beginning of a job to a temporary directory, like extracting a large number of input files. When using /dev/shm, for example, enter a sequence of commands like the following:

mkdir /dev/shm/$SLURM_JOB_ID
tar -C /dev/shm/$SLURM_JOB_ID -xf /scratch1/ttrojan/input.tar.gz

This example assumes that the input files have been previously bundled in a tar archive file.

0.0.3.3.2 End of job

If you want to keep temporary output files from a job, you may need to copy them to persistent storage. When using /dev/shm, for example, enter a command like the following:

tar -czf /scratch1/ttrojan/output.tar.gz /dev/shm/$SLURM_JOB_ID

This example bundles the temporary files in a tar archive file and saves it to the /scratch1 file system.

0.0.4 Fixing “disk quota exceeded” error

There are two main reasons you may get a “disk quota exceeded” error on CARC systems.

0.0.4.1 Hard quota limit

First, enter the command myquota and check your storage usage. You may have simply hit a hard quota limit, either total storage space or total number of files. In this case, try to delete, compress, consolidate, and/or archive files to free up space. Alternatively, for /project directories, you can request more space via a support ticket.

0.0.4.2 Incorrect group ownership

Second, for /project directories specifically, if your storage usage is not actually near the quota limits, then the likely cause is that the group ownership of some of your project files does not match the project group ID. For example, files in the project directory /project/ttrojan_123 should have group ownership by ttrojan_123:

[ttrojan@discovery1 ~]$ ls -ld /project/ttrojan_123/ttrojan/file.txt
-rw-rw---- 1 ttrojan ttrojan_123 293 Dec 10 15:10 /project/ttrojan_123/ttrojan/file.txt

In this example, ttrojan is the user owner ID and ttrojan_123 is the group owner ID.

The group ID is used to enforce the storage quota limits for project directories. By default, new files and directories should have the correct group ID, but it is possible to override this. Typically, for this error, the group ID for some of your files is your personal group (same as your username) (e.g., ttrojan), which has a small quota thus producing the “disk quota exceeded” error if new files are written with the personal group ID.

To check if you have files with the incorrect group ID, enter a command like the following substituting your username:

beegfs-ctl --getquota --mount=/project --gid ttrojan

To find files with the incorrect group ID, enter a command like the following substituting your project directory path and project group ID:

find /project/ttrojan_123/ttrojan \! -group ttrojan_123

The likely reason for files having the wrong group ID is using a mv, cp -a, scp -r, rsync -p, or rsync -a command that preserves file permissions from source files when moving or copying them into the project directory. Alternatively, some subdirectories within your project directory may not have the correct setgid bit that determines the default group ID for new files and directories.

The best method for moving or copying files into a project directory is rsync -rlt. If needed, you can delete the source files after a successful copy or add the --remove-source-files option.

To fix this issue, enter a sequence of commands like the following, substituting your project directory path and project ID:

chgrp -R ttrojan_123 /project/ttrojan_123/ttrojan
find /project/ttrojan_123/ttrojan -type d -exec chmod g+s {} \;

These commands will recursively change and set the default group ownership of files and subdirectories to match the project group. You will get an “operation not permitted” message for files you do not own, but this can be ignored. These commands will only change the files that you own.

It is best to run these commands only for specific subdirectories where you know you have files, because they may take awhile to run, especially for large directories.

You may also need to submit these commands for each project directory you have access to.

It may take about 15 minutes for the quota to update, after which you should be able to save new files again.