Moving Data from S3 using s5cmd

This example demonstrates how to create a Batch Job to download data from a private S3 bucket. A Batch Job workload is used due to the non-interactive nature of this task. The workload is configured with private information to access the S3 bucket and execute the commands necessary to install and run the s5cmd tool to download the data from that bucket to a mounted filestore.

Prerequisites

The following are required before running the data download workload:

  • Storage associated with your node group
  • A private S3 bucket with data you want to download
  • AWS access credentials (access key ID and secret access key) with permissions to read from the S3 bucket

Create AWS Credentials

To securely provide AWS credentials to your workload, you'll need to create secrets in DGX Cloud Lepton.

  1. Navigate to the Settings page in your workspace and click Secrets on the left side.
  2. Click + New Secret and select Custom Secret.
  3. Configure the secret:
    • Give it a descriptive name like AWS_SECRET_ACCESS_KEY.
    • Set the value to your AWS secret access key.
    • Set the visibility to Private if you do not want the secret to be used by other users in the workspace.
  4. Click Add to save.
  5. Repeat steps 2-4 for your AWS access key ID.

Create the Batch Job

Navigate to the create job page to configure your data download job.

Job Name: Set a descriptive name like s3-data-download. Resource: Select the desired node group and resource shape. You can use a CPU-only resource for this workload. Image: Use the built-in image. Run Command: Copy the following script and replace your-s3-bucket-name-here with your actual S3 bucket name:

# Download and extract s5cmd
wget https://github.com/peak/s5cmd/releases/download/v2.2.2/s5cmd_2.2.2_Linux-64bit.tar.gz
tar -xf s5cmd_2.2.2_Linux-64bit.tar.gz

# Download data from S3 to mounted storage
./s5cmd cp s3://your-s3-bucket-name-here/* /dataset/

Select Advanced Configuration, go to Environment Variables and click Add Secret. Add your AWS Access Key ID and Access Key secrets. This should look like the image below.

Add AWS secrets as environment variables

In the Storages section, click + Mount Storage. Select the storage volume where you want to download the data. Input your desired from path and /dataset as the mount path.

Click CREATE to submit your job to the queue. The job will automatically start when cluster resources become available.

Monitor Progress

After creating the job, you will be redirected to the job overview page. Monitor the job status until the status shows "Completed".

Once the status shows "Completed", click on the job name and go to the LOGS tab to verify the data transfer was successful. The logs should show the s5cmd output indicating successful data transfer from your S3 bucket to the mounted storage.

Copyright @ 2025, NVIDIA Corporation.