NVIDIA Docs Hub NVIDIA LaunchPad Build an AI Center of Excellence with Base Command Step #4: Dataset Creation (Download from S3)

Step #4: Dataset Creation (Download from S3)

There are many scenarios where datasets are stored in an object storage solution, either cloud-based or on-prem. There are a few different APIs commonly used to authenticate and access object storage, with the S3 API the most commonly used. The following examples will show you how to interact with a cloud-based S3 bucket without authentication. The steps will be comparable to the “Download from Internet” section, but instead of using a command line tool already built into the image, we’ll need to download and install a CLI tool as part of the job commands.

CLI

On the same system already authenticated with the target environment, submit the following batch run - it will download the awscli tool, unzip it, install it, and submit an s3 sync command to a publicly available dataset.

Copy
Copied!

            
            $ ngc batch run --name "aws-dl-cli" --preempt RUNONCE --instance dgxa100.80g.1.norm --commandline "curl 'https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip' -o 'awscliv2.zip'; unzip awscliv2.zip; ./aws/install -i /usr/local/aws-cli -b /usr/local/bin; aws s3 sync --no-sign-request --region=us-east-1 's3://safecast-opendata-public-us-east-1' '/results'" --result /results --image "nvidia/pytorch:22.04-py3"
--------------------------------------------------
Job Information
    Id: 2870152
    Name: aws-dl-cli
    Number of Replicas: 1
    Job Type: BATCH
    Submitted By: Joseph Handzik
Job Container Information
    Docker Image URL: nvidia/pytorch:22.04-py3
Job Commands
    Command: curl 'https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip' -o 'awscliv2.zip'; unzip awscliv2.zip; ./aws/install -i /usr/local/aws-cli -b /usr/local/bin; aws s3 sync --no-sign-request --region=us-east-1 's3://safecast-opendata-public-us-east-1' '/results'
    Dockerfile Image Entrypoint: False
Datasets, Workspaces and Results
    Result Mount Point: /results
Job Resources
    Instance Type: dgxa100.80g.1.norm
    Instance Details: 1 GPU, 30 CPU, 244 GB System Memory
    ACE: nv-launchpad-bc-sjc-ace
    Team: nvbc-tme
Job Status
    Created at: 2022-05-02 17:07:31 UTC
    Status: CREATED
    Preempt Class: RUNONCE
--------------------------------------------------

Submitting an ngc batch list command will list this job - once the Status becomes FINISHED_SUCCESS, we can proceed.

Copy
Copied!

            
            $ ngc batch list
+---------+----------+------------+----------+------------------+----------+----------------+
| Id      | Replicas | Name       | Team     | Status           | Duration | Status Details |
+---------+----------+------------+----------+------------------+----------+----------------+
| 2870152 | 1        | aws-dl-cli | nvbc-tme | FINISHED_SUCCESS | 0:00:21  |                |
+---------+----------+------------+----------+------------------+----------+----------------+

Using the Id value from the above output, issue an ngc dataset convert command.

Copy
Copied!

            
            $ ngc dataset convert --from-result 2870152 s3-dl-from-cli
Dataset with ID: '99885' created in ACE: 'nv-launchpad-bc-sjc-ace'.

Verify the new dataset’s existence with an ngc dataset list --owned command - s3-dl-from-cli should be at the top of the list.

Copy
Copied!

            
            $ ngc dataset list --owned
+-------------+------------+-------------+-------------+------------+--------+-----------+-----------+------------+-------+---------+
| Id          | Integer Id | Name        | Description | ACE        | Shared | Size      | Status    | Created    | Owned | Pre-pop |
|             |            |             |             |            |        |           |           | Date       |       |         |
+-------------+------------+-------------+-------------+------------+--------+-----------+-----------+------------+-------+---------+
| r3KPjT13TgC | 99885      | s3-dl-from- |             | nv-        | No     | 260.82 MB | COMPLETED | 2022-05-02 | Yes   | No      |
| Fw3a8Jvljow |            | cli         |             | launchpad- |        |           |           | 17:29:25   |       |         |
|             |            |             |             | bc-sjc-ace |        |           |           | UTC        |       |         |
| D9xMj6VNR-O | 99209      | linux-      |             | nv-        | No     | 1.03 GB   | COMPLETED | 2022-04-19 | Yes   | No      |
| itmadRHwFig |            | kernel-     |             | launchpad- |        |           |           | 17:27:47   |       |         |
|             |            | from-web-ui |             | bc-sjc-ace |        |           |           | UTC        |       |         |
| cHCMiESgRVe | 99164      | linux-      |             | nv-        | No     | 1.03 GB   | COMPLETED | 2022-04-18 | Yes   | No      |
| duwq8vFfIRQ |            | kernel-     |             | launchpad- |        |           |           | 21:11:38   |       |         |
|             |            | from-cli    |             | bc-sjc-ace |        |           |           | UTC        |       |         |
+-------------+------------+-------------+-------------+------------+--------+-----------+-----------+------------+-------+---------+

Web UI

As with the previous rounds of data creation, we can step through the same process using the Web UI. Starting from the Base Command Dashboard screen, click the Create Job button in the Jobs Overview pane.

Populate the Create Job form as follows.

Select your ACE from the list.
Select dgxa100.80g.1.norm in the INSTANCE table.
Ignore the Datasets and Workspaces input tables for now.
Enter /results for the Data Output field.
Select the nvidia/pytorch container.
Select the tag value of 22.04-py3.
For the command, copy the following:

Copy
Copied!

            
            curl 'https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip' -o 'awscliv2.zip'; unzip awscliv2.zip; ./aws/install -i /usr/local/aws-cli -b /usr/local/bin; aws s3 sync --no-sign-request --region=us-east-1 's3://safecast-opendata-public-us-east-1' '/results'

The remainder of the form can be left with the default values.

To verify that the correct options were selected, check the ngc batch run command on your screen with the values in the screenshot below (the --name argument will differ, and --ace and --org arguments may differ too). When ready, click the Launch Job button near the top right corner of the page.

The job shouldn’t take long to run - around 5 minutes assuming the ACE is relatively idle. You may need to refresh the page to verify that the job status is Finished Success.

Click on the newly-completed job, and navigate to the Results tab to verify that the downloaded data has been successfully downloaded.

To convert the results directory to a dataset, click on the vertical ellipsis near the top right corner of the page and click the Convert Results button.

A dialog box will appear - enter s3-dl-from-web-ui in the Name field, and click the Convert button.

After a short period of time, the Results directory for the job used to create the dataset will become empty, which indicates the dataset conversion has been executed. Click the Datasets link in the navigation pane, just below the Jobs link that has been previously used.

The new dataset should be present in the table on the Datasets page on the top data row.