Step #3: Dataset Creation (Download from Internet)

If the dataset you’re looking to use in Base Command isn’t already available on your local system, or would be more quickly transferred from elsewhere on the internet than from your own personal workstation, downloading the dataset from the internet in a job, saving the data to a result directory, and converting the result directory to a dataset is a great option.

First, on a system that is already authenticated to use your target environment, run this batch job - it uses the smallest available instance in this particular ACE to wget the target data (in this case, linux kernel source), un-tar the data, and delete the source tar.xz file. Remember, datasets are read-only - so it’s important to get the data in the exact form you want it to be in for future use, and remove any data you don’t want to preserve.

Copy
Copied!
            

$ ngc batch run --name "download-the-linux-kernel-to-result-dir" --preempt RUNONCE --min-timeslice 1s --total-runtime 0s --instance dgxa100.80g.1.norm --commandline "cd /results; wget https://cdn.kernel.org/pub/linux/kernel/v5.x/linux-5.17.3.tar.xz; tar -xvf linux-5.17.3.tar.xz; rm -rf linux-5.17.3.tar.xz" --result /results --image "nvidia/pytorch:22.04-py3" -------------------------------------------------- Job Information Id: 2818455 Name: download-the-linux-kernel-to-result-dir Number of Replicas: 1 Job Type: BATCH Submitted By: Joseph Handzik Job Container Information Docker Image URL: nvidia/pytorch:22.04-py3 Job Commands Command: cd /results; wget https://cdn.kernel.org/pub/linux/kernel/v5.x/linux-5.17.3.tar.xz; tar -xvf linux-5.17.3.tar.xz; rm -rf linux-5.17.3.tar.xz Dockerfile Image Entrypoint: False Datasets, Workspaces and Results Result Mount Point: /results Job Resources Instance Type: dgxa100.80g.1.norm Instance Details: 1 GPU, 30 CPU, 244 GB System Memory ACE: nv-launchpad-bc-sjc-ace Team: nvbc-tme Job Status Created at: 2022-04-18 19:13:54 UTC Status: STARTING Status Type: OK Preempt Class: RUNONCE Minimum Timeslice: 01S --------------------------------------------------

To check on the job’s status from the command line, run ngc batch info <job id> - in the output above under the Job Information section, the Id value is 2818455:

Copy
Copied!
            

$ ngc batch info 2818455 -------------------------------------------------- Job Information Id: 2818455 Name: download-the-linux-kernel-to-result-dir Number of Replicas: 1 Job Type: BATCH Submitted By: Joseph Handzik Job Container Information Docker Image URL: nvidia/pytorch:22.04-py3 Container name: 0367882ff645d349a97f9fcf56e33b25bbf473907878d2797e575a0560fc7665 Job Commands Command: cd /results; wget https://cdn.kernel.org/pub/linux/kernel/v5.x/linux-5.17.3.tar.xz; tar -xvf linux-5.17.3.tar.xz; rm -rf linux-5.17.3.tar.xz Dockerfile Image Entrypoint: False Datasets, Workspaces and Results Result Mount Point: /results Job Resources Instance Type: dgxa100.80g.1.norm Instance Details: 1 GPU, 30 CPU, 244 GB System Memory ACE: nv-launchpad-bc-sjc-ace Team: nvbc-tme Job Status Created at: 2022-04-18 19:13:54 UTC Started at: 2022-04-18 19:14:07 UTC Ended at: 2022-04-18 19:15:53 UTC Duration: 01M46S Status: FINISHED_SUCCESS Status Type: OK Preempt Class: RUNONCE Minimum Timeslice: 01S --------------------------------------------------

With a Status of FINISHED_SUCCESS, we can convert the /results directory’s contents into a dataset by using the ngc dataset convert command, which relies on the same job id value.

Copy
Copied!
            

$ ngc dataset convert --from-result 2818455 linux-kernel-from-cli Dataset with ID: '99164' created in ACE: 'nv-launchpad-bc-sjc-ace'.

We can now list out datasets that your account owns, which will include the new dataset just created from the previous job result (look for linux-kernel-from-cli).

Copy
Copied!
            

$ ngc dataset list –-owned +-------------------+------------+---------------+-------------+------------------+--------+----------+-----------+--------------+-------+---------+ | Id | Integer Id | Name | Description | ACE | Shared | Size | Status | Created Date | Owned | Pre-pop | +-------------------+------------+---------------+-------------+------------------+--------+----------+-----------+--------------+-------+---------+ | cHCMiESgRVeduwq8v | 99164 | linux-kernel- | | nv-launchpad-bc- | No | 1.03 GB | COMPLETED | 2022-04-18 | Yes | No | | FfIRQ | | from-cli | | sjc-ace | | | | 21:11:38 UTC | | | +-------------------+------------+---------------+-------------+------------------+--------+----------+-----------+--------------+-------+---------+

To create the same dataset using the Web UI instead, navigate to the Jobs link in the Base Command section of the NGC navigation pane.

step-03-image-001.png

Click on the Create Job button at the top of the Jobs page.

step-03-image-002.png

Populate the Create Job web form, using the following values (for any entries that are skipped, leave the default values as you find them):

  1. Select your ACE from the list.

  2. Select dgxa100.80g.1.norm in the INSTANCE table.

  3. Ignore the Datasets and Workspaces input tables for now.

  4. Enter /results for the Data Output field.

  5. Select the nvidia/pytorch container.

  6. Select the tag value of 22.04-py3.

  7. For the command, copy the following:

Copy
Copied!
            

cd /results; wget https://cdn.kernel.org/pub/linux/kernel/v5.x/linux-5.17.3.tar.xz; tar -xvf linux-5.17.3.tar.xz; rm -rf linux-5.17.3.tar.xz


  1. The remainder of the form can be left with the default values.

To verify that the correct options were selected, check the ngc batch run command on your screen with the values in the screenshot below (the --name argument will differ, and --ace and --org arguments may differ too). When ready, click the Launch Job button near the top right corner of the page.

To view workspaces, navigate to the Workspaces link underneath the Base Command menu.

step-03-image-003.png

The job shouldn’t take long to run - around 5 minutes assuming the ACE is relatively idle. You may need to refresh the page to verify that the job status is Finished Success.

step-03-image-004.png

Click on the newly-completed job, and navigate to the Results tab to verify that the downloaded data has been successfully extracted. The presence of the joblog.log file is expected - it will become a part of the dataset, but is ignorable when reusing the dataset.

step-03-image-005.png

To convert the results directory to a dataset, click on the vertical ellipsis near the top right corner of the page and click the Convert Results button.

step-03-image-006.png

A dialog box will appear - enter linux-kernel-from-web-ui in the Name field, and click the Convert button.

step-03-image-007.png

After a short period of time, the Results directory for the job used to create the dataset will become empty, which indicates the dataset conversion has been executed. Click the Datasets link in the navigation pane, just below the Jobs link that has been previously used.

step-03-image-008.png

The new dataset should be present in the table on the Datasets page on the top data row.

step-03-image-009.png

© Copyright 2022-2023, NVIDIA. Last updated on Jan 10, 2023.