Step #4: Dataset Creation (Download from S3)

There are many scenarios where datasets are stored in an object storage solution, either cloud-based or on-prem. There are a few different APIs commonly used to authenticate and access object storage, with the S3 API the most commonly used. The following examples will show you how to interact with a cloud-based S3 bucket without authentication. The steps will be comparable to the “Download from Internet” section, but instead of using a command line tool already built into the image, we’ll need to download and install a CLI tool as part of the job commands.

On the same system already authenticated with the target environment, submit the following batch run - it will download the awscli tool, unzip it, install it, and submit an s3 sync command to a publicly available dataset.

Copy
Copied!
            

$ ngc batch run --name "aws-dl-cli" --preempt RUNONCE --instance dgxa100.80g.1.norm --commandline "curl 'https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip' -o 'awscliv2.zip'; unzip awscliv2.zip; ./aws/install -i /usr/local/aws-cli -b /usr/local/bin; aws s3 sync --no-sign-request --region=us-east-1 's3://safecast-opendata-public-us-east-1' '/results'" --result /results --image "nvidia/pytorch:22.04-py3" -------------------------------------------------- Job Information Id: 2870152 Name: aws-dl-cli Number of Replicas: 1 Job Type: BATCH Submitted By: Joseph Handzik Job Container Information Docker Image URL: nvidia/pytorch:22.04-py3 Job Commands Command: curl 'https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip' -o 'awscliv2.zip'; unzip awscliv2.zip; ./aws/install -i /usr/local/aws-cli -b /usr/local/bin; aws s3 sync --no-sign-request --region=us-east-1 's3://safecast-opendata-public-us-east-1' '/results' Dockerfile Image Entrypoint: False Datasets, Workspaces and Results Result Mount Point: /results Job Resources Instance Type: dgxa100.80g.1.norm Instance Details: 1 GPU, 30 CPU, 244 GB System Memory ACE: nv-launchpad-bc-sjc-ace Team: nvbc-tme Job Status Created at: 2022-05-02 17:07:31 UTC Status: CREATED Preempt Class: RUNONCE --------------------------------------------------

Submitting an ngc batch list command will list this job - once the Status becomes FINISHED_SUCCESS, we can proceed.

Copy
Copied!
            

$ ngc batch list +---------+----------+------------+----------+------------------+----------+----------------+ | Id | Replicas | Name | Team | Status | Duration | Status Details | +---------+----------+------------+----------+------------------+----------+----------------+ | 2870152 | 1 | aws-dl-cli | nvbc-tme | FINISHED_SUCCESS | 0:00:21 | | +---------+----------+------------+----------+------------------+----------+----------------+

Using the Id value from the above output, issue an ngc dataset convert command.

Copy
Copied!
            

$ ngc dataset convert --from-result 2870152 s3-dl-from-cli Dataset with ID: '99885' created in ACE: 'nv-launchpad-bc-sjc-ace'.

Verify the new dataset’s existence with an ngc dataset list --owned command - s3-dl-from-cli should be at the top of the list.

Copy
Copied!
            

$ ngc dataset list --owned +-------------+------------+-------------+-------------+------------+--------+-----------+-----------+------------+-------+---------+ | Id | Integer Id | Name | Description | ACE | Shared | Size | Status | Created | Owned | Pre-pop | | | | | | | | | | Date | | | +-------------+------------+-------------+-------------+------------+--------+-----------+-----------+------------+-------+---------+ | r3KPjT13TgC | 99885 | s3-dl-from- | | nv- | No | 260.82 MB | COMPLETED | 2022-05-02 | Yes | No | | Fw3a8Jvljow | | cli | | launchpad- | | | | 17:29:25 | | | | | | | | bc-sjc-ace | | | | UTC | | | | D9xMj6VNR-O | 99209 | linux- | | nv- | No | 1.03 GB | COMPLETED | 2022-04-19 | Yes | No | | itmadRHwFig | | kernel- | | launchpad- | | | | 17:27:47 | | | | | | from-web-ui | | bc-sjc-ace | | | | UTC | | | | cHCMiESgRVe | 99164 | linux- | | nv- | No | 1.03 GB | COMPLETED | 2022-04-18 | Yes | No | | duwq8vFfIRQ | | kernel- | | launchpad- | | | | 21:11:38 | | | | | | from-cli | | bc-sjc-ace | | | | UTC | | | +-------------+------------+-------------+-------------+------------+--------+-----------+-----------+------------+-------+---------+

As with the previous rounds of data creation, we can step through the same process using the Web UI. Starting from the Base Command Dashboard screen, click the Create Job button in the Jobs Overview pane.

step-04-image-001.png

Populate the Create Job form as follows.

  1. Select your ACE from the list.

  2. Select dgxa100.80g.1.norm in the INSTANCE table.

  3. Ignore the Datasets and Workspaces input tables for now.

  4. Enter /results for the Data Output field.

  5. Select the nvidia/pytorch container.

  6. Select the tag value of 22.04-py3.

  7. For the command, copy the following:

Copy
Copied!
            

curl 'https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip' -o 'awscliv2.zip'; unzip awscliv2.zip; ./aws/install -i /usr/local/aws-cli -b /usr/local/bin; aws s3 sync --no-sign-request --region=us-east-1 's3://safecast-opendata-public-us-east-1' '/results'


  1. The remainder of the form can be left with the default values.

To verify that the correct options were selected, check the ngc batch run command on your screen with the values in the screenshot below (the --name argument will differ, and --ace and --org arguments may differ too). When ready, click the Launch Job button near the top right corner of the page.

step-04-image-002.png

The job shouldn’t take long to run - around 5 minutes assuming the ACE is relatively idle. You may need to refresh the page to verify that the job status is Finished Success.

step-04-image-003.png

Click on the newly-completed job, and navigate to the Results tab to verify that the downloaded data has been successfully downloaded.

step-04-image-004.png

To convert the results directory to a dataset, click on the vertical ellipsis near the top right corner of the page and click the Convert Results button.

step-04-image-005.png

A dialog box will appear - enter s3-dl-from-web-ui in the Name field, and click the Convert button.

step-04-image-006.png

After a short period of time, the Results directory for the job used to create the dataset will become empty, which indicates the dataset conversion has been executed. Click the Datasets link in the navigation pane, just below the Jobs link that has been previously used.

step-04-image-007.png

The new dataset should be present in the table on the Datasets page on the top data row.

step-04-image-008.png

© Copyright 2022-2023, NVIDIA. Last updated on Jan 10, 2023.