Step #7: Workspace Conversion to a Dataset

After iterating on a workspace over a period of time, part or all of the workspace may be ready to be turned into a dataset. Similar to most other dataset creation processes, we can use a job results directory to stage data movement between locations - in the following example, the workspace we’ve been working with so far will be the source for new datasets.

At a high level, the goal here is to combine a few different workflows that we have used separately so far:

  • Mount a workspace into a job

  • Copy data into a job result directory

  • Convert a job result directory into a dataset

We can either simply copy the data or move it completely - a good way to think about this is to copy if the data needs to be iterated on again into another distinct dataset, and move the data if it would never need to be interacted with outside of a read-only dataset context again. For this job example, the data will be moved to avoid unnecessary data duplication.

Copy
Copied!
            

$ ngc batch run --name "move-data-out-of-workspace" --preempt RUNONCE --min-timeslice 0s --total-runtime 0s --instance dgxa100.80g.1.norm --commandline "mv /mount/workspace/* /results" --result /results --image "nvidia/pytorch:22.04-py3" --workspace workspace-from-cli:/mount/workspace:RW -------------------------------------------------- Job Information Id: 2873764 Name: move-data-out-of-workspace Number of Replicas: 1 Job Type: BATCH Submitted By: Joseph Handzik Job Container Information Docker Image URL: nvidia/pytorch:22.04-py3 Job Commands Command: mv /mount/workspace/* /results Dockerfile Image Entrypoint: False Datasets, Workspaces and Results Workspace ID: JyrRwRmXQCuptjLaCFumkQ Workspace Name: workspace-from-cli Workspace Mount Point: /mount/workspace Workspace Mount Mode: RW Result Mount Point: /results Job Resources Instance Type: dgxa100.80g.1.norm Instance Details: 1 GPU, 30 CPU, 244 GB System Memory ACE: nv-launchpad-bc-sjc-ace Team: nvbc-tme Job Status Created at: 2022-05-03 20:33:05 UTC Status: STARTING Status Type: OK Preempt Class: RUNONCE --------------------------------------------------

As in other sections, checking the output of ngc batch list for this job’s status to become FINISHED_SUCCESS will indicate we can proceed.

Copy
Copied!
            

$ ngc batch list +---------+----------+----------------------------+----------+------------------+----------+----------------+ | Id | Replicas | Name | Team | Status | Duration | Status Details | +---------+----------+----------------------------+----------+------------------+----------+----------------+ | 2873764 | 1 | move-data-out-of-workspace | nvbc-tme | FINISHED_SUCCESS | 0:03:05 | | | 2873341 | 1 | load-linux-into-workspace | nvbc-tme | FINISHED_SUCCESS | 0:02:01 | | | 2870152 | 1 | aws-dl-cli | nvbc-tme | FINISHED_SUCCESS | 0:00:21 | | +---------+----------+----------------------------+----------+------------------+----------+----------------+

Check the workspace size again - it should be 0 bytes, because all data has been copied out of it.

Copy
Copied!
            

$ ngc workspace list --owned +------------------------+--------------------+-------------+----------------------+------------------+--------+---------------------+-------+------+ | Id | Name | Description | ACE | Creator Username | Shared | Created Date | Owned | Size | +------------------------+--------------------+-------------+----------------------+------------------+--------+---------------------+-------+------+ | JyrRwRmXQCuptjLaCFumkQ | workspace-from-cli | | nv-launchpad-bc-sjc- | Joseph Handzik | No | 2022-05-02 19:34:40 | Yes | 0 B | | | | | ace | | | UTC | | | | a6CxaJmmQgK1OIcXR8-d5A | s3-demo-workspace | | nv-launchpad-bc-sjc- | Joseph Handzik | No | 2022-04-19 19:29:12 | Yes | 0 B | | | | | ace | | | UTC | | | +------------------------+--------------------+-------------+----------------------+------------------+--------+---------------------+-------+------+

We will convert the previous job’s results directory as we have before, naming it clearly to indicate how it has been created differently from other datasets.

Copy
Copied!
            

$ ngc dataset convert --from-result 2873764 linux-kernel-from-workspace Dataset with ID: '99967' created in ACE: 'nv-launchpad-bc-sjc-ace'.

A final ngc dataset list run will confirm the new dataset exists as expected.

Copy
Copied!
            

$ ngc dataset list --owned +--------------------+------------+--------------------+--------------------+------------------+--------+-----------+-----------+--------------+-------+---------+ | Id | Integer Id | Name | Description | ACE | Shared | Size | Status | Created Date | Owned | Pre-pop | +--------------------+------------+--------------------+--------------------+------------------+--------+-----------+-----------+--------------+-------+---------+ | 8brma-E4RX6xVMmpP7 | 99967 | linux-kernel-from- | | nv-launchpad-bc- | No | 1.02 GB | COMPLETED | 2022-05-03 | Yes | No | | OfpQ | | workspace | | sjc-ace | | | | 20:45:49 UTC | | | | r3KPjT13TgCFw3a8Jv | 99885 | s3-dl-from-cli | | nv-launchpad-bc- | No | 260.82 MB | COMPLETED | 2022-05-02 | Yes | No | | ljow | | | | sjc-ace | | | | 17:29:25 UTC | | | | D9xMj6VNR- | 99209 | linux-kernel-from- | | nv-launchpad-bc- | No | 1.03 GB | COMPLETED | 2022-04-19 | Yes | No | | OitmadRHwFig | | web-ui | | sjc-ace | | | | 17:27:47 UTC | | | | cHCMiESgRVeduwq8vF | 99164 | linux-kernel-from- | | nv-launchpad-bc- | No | 1.03 GB | COMPLETED | 2022-04-18 | Yes | No | | fIRQ | | cli | | sjc-ace | | | | 21:11:38 UTC | | | | CzgFjq__SSqAb46hSj | 99150 | linux-nvbug-triage | | nv-launchpad-bc- | No | 1.03 GB | COMPLETED | 2022-04-18 | Yes | No | | k5Nw | | | | sjc-ace | | | | 14:47:47 UTC | | | | a5LuVPZPQhO7SUWMIz | 98828 | linux-kernel | My favorite kernel | nv-launchpad-bc- | No | 1.03 GB | COMPLETED | 2022-04-14 | Yes | No | | cpZw | | | | sjc-ace | | | | 22:32:50 UTC | | | +--------------------+------------+--------------------+--------------------+------------------+--------+-----------+-----------+--------------+-------+---------+

To accomplish the migration of data from a workspace to a dataset in the Web UI, start at the Base Command Dashboard again, and click the Create Job button.

step-07-image-001.png

Populate the Create Job as follows.

  1. Select your ACE from the list.

  2. Select dgxa100.80g.1.norm in the INSTANCE table.

  3. Select the WORKSPACES tab - click the box next to workspace-from-web-ui, and type /mount/workspace for its mount point, ensuring that the mount mode is RW.

  4. Enter /results for the Data Output field.

  5. Select the nvidia/pytorch container.

  6. Select the tag value of 22.04-py3.

  7. For the command, copy the following:

Copy
Copied!
            

mv /mount/workspace/* /results


  1. The remainder of the form can be left with the default values.

To verify that the correct options were selected, check the ngc batch run command on your screen with the values in the screenshot below (the --name argument will differ, and --ace and --org arguments may differ too). When ready, click the Launch Job button near the top right corner of the page.

step-07-image-002.png

The job shouldn’t take long to run - around 5 minutes assuming the ACE is relatively idle. You may need to refresh the page to verify that the job status is Finished Success. Click on the newly-completed job.

step-07-image-003.png

Navigate to the Results tab to verify that the data has been moved out of the workspace.

step-07-image-004.png

To convert the results directory to a dataset, click on the vertical ellipsis near the top right corner of the page and click the Convert Results button.

step-07-image-005.png

A dialog box will appear - enter s3-dl-from-workspace in the Name field, and click the Convert button.

step-07-image-006.png

After a short period of time, the Results directory for the job used to create the dataset will become empty, which indicates the dataset conversion has been executed. Click the Datasets link in the navigation pane, just below the Jobs link that has been previously used.

step-07-image-007.png

The new dataset should be present in the table on the Datasets page on the top data row.

step-07-image-008.png

© Copyright 2022-2023, NVIDIA. Last updated on Jan 10, 2023.