Step #3: Dataset Creation (Download from Internet)

NVIDIA Docs Hub Homepage Build an AI Center of Excellence with Base Command Step #3: Dataset Creation (Download from Internet)

Step #3: Dataset Creation (Download from Internet)

If the dataset you’re looking to use in Base Command isn’t already available on your local system, or would be more quickly transferred from elsewhere on the internet than from your own personal workstation, downloading the dataset from the internet in a job, saving the data to a result directory, and converting the result directory to a dataset is a great option.

CLI

First, on a system that is already authenticated to use your target environment, run this batch job - it uses the smallest available instance in this particular ACE to wget the target data (in this case, linux kernel source), un-tar the data, and delete the source tar.xz file. Remember, datasets are read-only - so it’s important to get the data in the exact form you want it to be in for future use, and remove any data you don’t want to preserve.

Copy
Copied!

            
            $ ngc batch run --name "download-the-linux-kernel-to-result-dir" --preempt RUNONCE --min-timeslice 1s --total-runtime 0s --instance dgxa100.80g.1.norm --commandline "cd /results; wget https://cdn.kernel.org/pub/linux/kernel/v5.x/linux-5.17.3.tar.xz; tar -xvf linux-5.17.3.tar.xz; rm -rf linux-5.17.3.tar.xz" --result /results --image "nvidia/pytorch:22.04-py3"
--------------------------------------------------
Job Information
    Id: 2818455
    Name: download-the-linux-kernel-to-result-dir
    Number of Replicas: 1
    Job Type: BATCH
    Submitted By: Joseph Handzik
Job Container Information
    Docker Image URL: nvidia/pytorch:22.04-py3
Job Commands
    Command: cd /results; wget https://cdn.kernel.org/pub/linux/kernel/v5.x/linux-5.17.3.tar.xz; tar -xvf linux-5.17.3.tar.xz; rm -rf linux-5.17.3.tar.xz
    Dockerfile Image Entrypoint: False
Datasets, Workspaces and Results
    Result Mount Point: /results
Job Resources
    Instance Type: dgxa100.80g.1.norm
    Instance Details: 1 GPU, 30 CPU, 244 GB System Memory
    ACE: nv-launchpad-bc-sjc-ace
    Team: nvbc-tme
Job Status
    Created at: 2022-04-18 19:13:54 UTC
    Status: STARTING
    Status Type: OK
    Preempt Class: RUNONCE
    Minimum Timeslice: 01S
--------------------------------------------------

To check on the job’s status from the command line, run ngc batch info <job id> - in the output above under the Job Information section, the Id value is 2818455:

Copy
Copied!

            
            $ ngc batch info 2818455
--------------------------------------------------
Job Information
    Id: 2818455
    Name: download-the-linux-kernel-to-result-dir
    Number of Replicas: 1
    Job Type: BATCH
    Submitted By: Joseph Handzik
Job Container Information
    Docker Image URL: nvidia/pytorch:22.04-py3
    Container name: 0367882ff645d349a97f9fcf56e33b25bbf473907878d2797e575a0560fc7665
Job Commands
    Command: cd /results; wget https://cdn.kernel.org/pub/linux/kernel/v5.x/linux-5.17.3.tar.xz; tar -xvf linux-5.17.3.tar.xz; rm -rf linux-5.17.3.tar.xz
    Dockerfile Image Entrypoint: False
Datasets, Workspaces and Results
    Result Mount Point: /results
Job Resources
    Instance Type: dgxa100.80g.1.norm
    Instance Details: 1 GPU, 30 CPU, 244 GB System Memory
    ACE: nv-launchpad-bc-sjc-ace
    Team: nvbc-tme
Job Status
    Created at: 2022-04-18 19:13:54 UTC
    Started at: 2022-04-18 19:14:07 UTC
    Ended at: 2022-04-18 19:15:53 UTC
    Duration: 01M46S
    Status: FINISHED_SUCCESS
    Status Type: OK
    Preempt Class: RUNONCE
    Minimum Timeslice: 01S
--------------------------------------------------

With a Status of FINISHED_SUCCESS, we can convert the /results directory’s contents into a dataset by using the ngc dataset convert command, which relies on the same job id value.

Copy
Copied!

            
            $ ngc dataset convert --from-result 2818455 linux-kernel-from-cli
Dataset with ID: '99164' created in ACE: 'nv-launchpad-bc-sjc-ace'.

We can now list out datasets that your account owns, which will include the new dataset just created from the previous job result (look for linux-kernel-from-cli).

Copy
Copied!

            
            $ ngc dataset list –-owned
+-------------------+------------+---------------+-------------+------------------+--------+----------+-----------+--------------+-------+---------+
| Id                | Integer Id | Name          | Description | ACE              | Shared | Size     | Status    | Created Date | Owned | Pre-pop |
+-------------------+------------+---------------+-------------+------------------+--------+----------+-----------+--------------+-------+---------+
| cHCMiESgRVeduwq8v | 99164      | linux-kernel- |             | nv-launchpad-bc- | No     | 1.03 GB  | COMPLETED | 2022-04-18   | Yes   | No      |
| FfIRQ             |            | from-cli      |             | sjc-ace          |        |          |           | 21:11:38 UTC |       |         |
+-------------------+------------+---------------+-------------+------------------+--------+----------+-----------+--------------+-------+---------+

Web UI

To create the same dataset using the Web UI instead, navigate to the Jobs link in the Base Command section of the NGC navigation pane.

Click on the Create Job button at the top of the Jobs page.

Populate the Create Job web form, using the following values (for any entries that are skipped, leave the default values as you find them):

Select your ACE from the list.
Select dgxa100.80g.1.norm in the INSTANCE table.
Ignore the Datasets and Workspaces input tables for now.
Enter /results for the Data Output field.
Select the nvidia/pytorch container.
Select the tag value of 22.04-py3.
For the command, copy the following:

Copy
Copied!

            
            cd /results; wget https://cdn.kernel.org/pub/linux/kernel/v5.x/linux-5.17.3.tar.xz; tar -xvf linux-5.17.3.tar.xz; rm -rf linux-5.17.3.tar.xz

The remainder of the form can be left with the default values.

To verify that the correct options were selected, check the ngc batch run command on your screen with the values in the screenshot below (the --name argument will differ, and --ace and --org arguments may differ too). When ready, click the Launch Job button near the top right corner of the page.

To view workspaces, navigate to the Workspaces link underneath the Base Command menu.

The job shouldn’t take long to run - around 5 minutes assuming the ACE is relatively idle. You may need to refresh the page to verify that the job status is Finished Success.

Click on the newly-completed job, and navigate to the Results tab to verify that the downloaded data has been successfully extracted. The presence of the joblog.log file is expected - it will become a part of the dataset, but is ignorable when reusing the dataset.

To convert the results directory to a dataset, click on the vertical ellipsis near the top right corner of the page and click the Convert Results button.

A dialog box will appear - enter linux-kernel-from-web-ui in the Name field, and click the Convert button.

After a short period of time, the Results directory for the job used to create the dataset will become empty, which indicates the dataset conversion has been executed. Click the Datasets link in the navigation pane, just below the Jobs link that has been previously used.

The new dataset should be present in the table on the Datasets page on the top data row.