Transferring Data Using the Command Line

Last updated March 11, 2024

There are many command-line tools for transferring data to and from CARC systems, each with intended uses and specific feature sets.

Data transfers can be classified into two types. Those between CARC systems and a local computer, and those between the internet and CARC systems. And those between CARC and a secret 3rd thing.

When transferring data from your local computer to CARC systems consider the recommended command line tools are sftp, rsync, and globus-cli.

When transferring data between the internet and CARC systems the following table lists the available command-line tools based on the kind of service being used:

Scenario Options
File servers sftp, lftp
Globus shares globus-cli
Aspera servers aspera-cli
Downloads wget, curl, aria2c
Cloud storage rclone
Code git

Below, you will find descriptions, comparisons, and examples of how to use each tool.

If you have questions about transferring data using these tools, please submit a help ticket and we will assist you.

Due to security risks, please be mindful of the type of information being transferred. Where possible, omit all information that may be considered confidential. For examples of confidential information that requires additional consideration, visit https://sites.usc.edu/trojansecure/information-data-security/.

0.0.1 General recommendations

  • Only transfer data that is necessary
  • Compress large files using xz to reduce size of transfer (depending on network speed)
  • Archive files using tar or 7-Zip when transferring large numbers of files
  • For small-to-medium transfers to/from your local computer, use sftp or rsync
  • For large transfers to/from your local computer or other endpoint, use Globus
  • For syncing directories, use rsync
  • For transfers to/from an FTP server, use lftp
  • For faster or parallel downloads, use aria2c
  • For transfers to/from cloud storage, use rclone
  • For long-running transfers, run command within tmux

0.0.2 Archiving and compressing files before transferring

Creating and compressing a single archive file can be useful before transferring files to or from CARC systems, especially for directories with a large number of files (e.g., > 1000, regardless of the total size of those files). Each file has associated metadata, which can slow down the transfer. Compressing files will reduce the amount of data that needs to be transferred. However, it takes time to compress and uncompress files, so the total transfer time may not necessarily decrease depending on factors like network speeds. With fast network speeds, relative to total transfer size, it is typically not worth compressing files. For more information, see the section on archiving and compressing files in the guide for Managing Research Data.

0.0.3 Local computer ⇄ CARC systems

To copy files between your local computer and CARC systems, the available options are sftp and rsync. These are available on macOS and Linux through the native terminal applications and on Windows through applications like Windows Terminal or PuTTY. Globus also provides a command-line interface (globus-cli) that you can install; for more information, see the guide for Transferring Files Using Globus.

sftp provides an interactive mode that requires authenticating only once and maintains an open connection to transfer files as needed until the session is exited. In contrast, rsync can only be used in a non-interactive mode that may require authentication for each transfer (depending on SSH settings).

Instructions and commands for these tools are detailed in the collapsible sections below:

→ sftp  

0.0.3.1 Using sftp

sftp is a client program for transferring files using the Secure File Transfer Protocol (SFTP). It can be used in interactive or non-interactive modes to copy files between two computers over a network, one local and one remote. In interactive mode, it requires an initial login and authentication, but your session will remain open until you exit or are otherwise disconnected. You will remain connected to CARC systems with the ability to upload (put) and download (get) files without further authentication. This is a benefit of using sftp compared to the other command-line transfer tools.

To use sftp in interactive mode, from your local computer, first log in to a CARC node like hpc-transfer1 and authenticate via Duo:

sftp ttrojan@hpc-transfer1.usc.edu

If it is your first time logging in, you will be asked “Are you sure you want to continue connecting (yes/no)?”. Enter “yes”. You will see the following once you are connected:

Connected to hpc-transfer1.usc.edu.
sftp>

Enter the help command to view all the available commands. Use commands like pwd, ls, and cd, and their local equivalents lpwd, lls, and lcd, to navigate to the source and destination directories for file transfers.

sftp> lpwd
Local working directory: /home/tommy
sftp> lcd myimages
sftp> lls
myplot1.jpg myplot2.jpg
sftp> pwd
Remote working directory: /home1/ttrojan
sftp> cd /scratch1/ttrojan/images

0.0.3.4 Uploading file/directory from local computer to CARC systems

To upload a file, use the put command:

sftp> put myplot1.jpg myplot.jpg
Uploading myplot1.jpg to /scratch1/ttrojan/myplot.jpg
myplot1.jpg                                 100%   10KB   2.4MB/s   00:01    

To upload a directory, recursively, add the -R option and specify the path to the local directory (e.g., put -R dir).

0.0.3.5 Downloading file/directory from CARC systems to local computer

To download a file, use the get command:

sftp> get myplot3.jpg myplot3.jpg
Fetching /scratch1/ttrojan/myplot3.jpg to myplot3.jpg
/scratch1/ttrojan/myplot3.jpg                100%   10KB   2.4MB/s   00:01    

To download a directory, recursively, add the -R option and specify the path to the remote directory (e.g., get -R dir).

→ rsync  

0.0.3.6 Using rsync

Rsync is a fast and versatile transfer tool for synchronizing files and directories. It is typically used to copy, sync, and back up directories between two computers over a network, one local and one remote. Rsync is also useful for copying and syncing directories locally. It uses a delta-transfer algorithm to minimize the amount of data that needs to be transferred—only new or modified files in a directory will be transferred. By default, Rsync will use SSH to securely transfer files over network. Unlike sftp, login and authentication may be requested for each use of the rsync command (depending on SSH settings).

A generic rsync command is:

rsync <options> source destination

source and destination are file or directory paths. When one of these paths is on a remote host, the syntax becomes host:path. On CARC systems, the host is a login or transfer node. When the command is submitted, you will first need to enter your password and complete the Duo authentication, and then the transfer will begin.

When uploading a local directory to your project directory, the destination is on a remote host. From your local computer, enter a command like the following:

rsync -rltvh /home/tommy/data ttrojan@hpc-transfer1.usc.edu:/project/ttrojan_123

When downloading a directory from your project directory, the source is on a remote host. From your local computer, enter a command like the following:

rsync -rltvh ttrojan@hpc-transfer1.usc.edu:/project/ttrojan_123/data /home/tommy

The -rlt options enable transferring directories recursively, copying symbolic links, and preserving modification times. The -v option enables verbose mode. The -h option prints transfer size and related information in a human-readable format.

After making changes to a source directory, simply enter the same rsync command again to sync the destination directory. If files deleted from the source should also be deleted from the destination, add the --del option.

The rsync command is sensitive to a trailing / on the source directory (e.g., data vs data/). If not included, it will copy the directory, as well as its contents, to the destination directory as a new subdirectory. If included, it will not copy the directory itself, but only the contents to the destination directory.

0.0.3.7 Rsync options

Rsync provides many other options than those used in the examples above. Here are some other useful options:

Option Description
--del Delete files from destination if deleted from source
-z or --compress Compress files during transfer
--append-verify Keep, check, and update partially transferred files
--progress Display progress of file transfers
--stats Print transfer statistics
-n or --dry-run Perform a trial run with no changes made
--log-file=rsync.log Log what rsync does to file rsync.log

For transfers of large files that may take a long time, consider adding the -z option to compress files as well as the --append-verify option, which will keep partially transferred files. If the transfer is interrupted, re-entering the same command will restart the transfer where it stopped and append data to the partial file.

Enter man rsync or rsync --help for more more information and to view all available options.

If you experience issues with disconnections during an rsync transfer, try adding the option --timeout=60 to keep the connection alive for 60 seconds in case the transfer idles. Sometimes network latency can cause disconnects.

 

0.0.4 CARC systems ⇄ internet

There are many tools available to transfer files to and from CARC systems and endpoints on the public internet, such as FTP file servers or HTTP web servers. Keep in mind that CARC compute nodes do not have access to the internet, so complete these transfers on the login or transfer nodes separately from Slurm jobs.

→ File servers: sftp and lftp  

0.0.4.1 Using sftp and lftp

For file servers that use the SFTP protocol, you can use the sftp program to transfer files. Examples of how to use sftp can be found in the previous section on sftp above, with the only difference being the remote server that you interact with.

For file servers that use FTP, SFTP, or other FTP-like protocols, you can use the lftp module to transfer files: module load lftp. The lftp program has a similar interface and commands to sftp but has additional features, including multi-connection and parallel downloads. For more information and available options, enter man lftp or see the official lftp documentation.

The wget, curl, and aria2c programs can also be used to non-interactively download files from FTP or SFTP servers. The sftp, lftp, and curl programs can also be used to non-interactively upload files to FTP or SFTP servers.

→ Downloads: wget, curl, and aria2c  

0.0.4.2 Using wget, curl, and aria2c

The main tools focused on downloading files from the web (i.e., from sources using HTTP and HTTPS protocols, like web sites) are wget, curl, and aria2c. They can also be used to non-interactively download files from FTP or SFTP servers.

In general, wget is the simplest to use, curl offers more advanced features useful in scripting, and aria2c offers multi-connection and parallel downloads to improve the speed of large transfers.

0.0.4.3 Using wget

For simple file downloads from the web, the wget program is the easiest to use. Just provide the URL to the file:

wget <url>

Enter man wget or wget --help for more information and to view all available options.

0.0.4.4 Using curl

The curl program supports more protocols and provides more advanced features for downloading (and uploading) files, especially for scripting purposes. For a simple file download, use the -O option and provide the URL to the file:

curl -O <url>

Without the -O option, curl will simply print the contents to the screen. This is the default behavior and is useful when piping the contents of a file as input into another command.

Enter man curl or curl --help for more information and to view all available options. For even more information, see the official curl documentation.

0.0.4.5 Using aria2c

For large downloads, the aria2c program is a better choice because it offers multi-connection and parallel downloads that can reduce transfer times. For a simple file download, just provide the URL to the file:

aria2c <url>

For a large file, add the -x option to use multiple connections to the file that will reduce the download time. For example, the following command opens 4 connections:

aria2c -x4 <url>

You can also specify a list of URLs in a file using the -i option and then use the -j option to specify the number of files to download in parallel. For example, given a file urls.txt that lists a URL to a file on each line, the following command will download 4 of these files concurrently:

aria2c -i urls.txt -j4

Enter man aria2c or aria2c --help for more information and to view all available options. For even more information, see the official aria2 documentation.

→ Cloud storage: rclone  

0.0.4.6 Using rclone

For cloud storage, you can use the rclone module to transfer files: module load rclone. This requires some initial setup and configuration. For more information, see the guide for Transferring Files Using Rclone.

→ Code: git  

0.0.4.7 Using git

Git is a source-code management program useful for version control and collaborative development. You can use git commands to manage code repositories and push and pull changes to and from CARC systems. We recommend using a central remote repository at services like GitHub, GitLab, or BitBucket. You can develop code directly on CARC systems in a Git repository in one of your directories and use the remote repository to back up and sync changes. You can also develop code on your local computer as part of a Git repository, push changes to a remote repository, log in to CARC systems, and pull the changes to the corresponding repository located in one of your directories.

Enter man git or git --help for more information. For even more information, see the official Git documentation.

 

0.0.5 Verifying file integrity after transfers

Regardless of which command-line transfer tool you use, you may wish to ensure file integrity after file transfers. Some of the tools described above have built-in options to verify file integrity — check the tool’s documentation to confirm this and learn how to use the option. Alternatively, you can use SHA-256 checksums, for example, to verify that files were successfully copied.

To generate checksums at the source directory, the exact command to use will differ depending on the system. On Linux, the command is sha256sum; on macOS, the command is shasum; and on Windows, the command is Get-FileHash. You may also be able to use GUI apps to generate checksums as well. Using Linux as an example, in the source directory enter a command similar to the following:

find . -type f -exec sha256sum '{}' \; > sha256sum.txt

This will generate the file sha256sum.txt. Copy this file to the destination directory where files were transferred, and then from that directory enter:

sha256sum -c sha256sum.txt

This compares the file checksums from the source with the file checksums in the destination and prints the results. The transfer was successful if all of the checksums match, as indicated by an OK status. Note that the sha256sum.txt file itself will fail because it was not originally present in the source directory.