Base Command Platform

1. Introduction to NVIDIA Base Command Platform

NVIDIA Base Command Platform is a comprehensive platform for businesses, their data scientists, and IT teams, offered in a ready-to-use cloud-hosted solution that manages the end-to-end lifecycle of AI development, AI workflows, and resource management.

NVIDIA Base Command Platform provides

  • A set of cloud-hosted tools that lets data scientists access the AI infrastructure without interfering with each other.

  • A comprehensive cloud-based UI, and a complete command line API to efficiently execute AI workloads with right-sized resources ranging from a single GPU to a multi-node cluster with dataset management, providing quick delivery of production-ready models and applications.

  • A built-in telemetry feature to validate deep learning techniques, workload settings, and resource allocations as part of a constant improvement process.

  • Reporting and showback capabilities for business leaders who want to measure AI projects against business goals, as well as team managers who need to set project priorities and plan for a successful future by correctly forecasting compute capacity needs.

1.1. NVIDIA Base Command Platform Terms and Concepts

The following are a description of common NVIDIA Base Command Platform terms used in this document.

NVIDIA Base Command Platform Terms

Term

Definition

Accelerated Computing Environment (ACE)

An ACE is a cluster or an availability zone. Each ACE has separate storage, compute, and networking.

NGC Catalog

NGC Catalog is a curated set of GPU-optimized software maintained by NVIDIA and accessible to the general public.

It consists of containers, pre-trained models, Helm charts for Kubernetes deployments, and industry-specific AI toolkits with software development kits (SDKs).

Container Images

All applications running in NGC are containerized as Docker containers and execute in our Runtime environment. Containers are stored in the NGC Container Registry nvcr.io, accessible from both the CLI and the Web UI.

Container Port

Opening a port when creating a job will create a URL that can be used to reach the container on that port using web protocols. The security of web applications (e.g. Jupyterlab) that are accessed this way is the user’s responsibility. See note below.

Dataset

Datasets are the data inputs to a job, mounted as read-only to the location specified in the job. They can contain data or code. Datasets are covered in detail in the Datasets section.

Data Results

Result is a read-write mount specified by the job and captured by the system. All data written to the result is available once the job completes, along with contents of stdout and stderr.

Instance

The instance determines the number of CPU cores, RAM size, and the type and number of GPUs available to the job. Instance types from one to eight GPUs are available depending on the ACE.

Job

A Job is the fundamental unit of computation - a container running an NVIDIA Base Command Platform instance in an ACE. Job is defined by the set of attributes specified at submission.

Job Definition

The attributes that define a job.

Job Command

Each Job can specify a command to run inside the container. The command can be as simple or as complex as needed, as long as quotes are properly escaped.

Jobs - Multinode

A job that is run on multiple nodes.

Models

NGC offers a collection of State of the Art pre-trained deep learning models that can be easily used out of the box, re-trained or fine-tuned.

Org

The enterprise organization with its own registry space. Users are assigned (or belong) to an org.

Team

A sub-unit within an organization with its own registry space. Only members of the same team have access to that team’s registry space.

Users

Anyone with a Base Command Platform account. Users are assigned to an org.

Private Registry

The NGC private registry provides you with a secure space to store and share custom containers, models, resources, and Helm charts within your enterprise.

Quota

Every user is assigned a default GPU and storage quota. GPU quota defines the maximum number of concurrent GPUs in use by a user account. Each user is allocated a default initial storage quota. Your storage assets (datasets, results, and workspaces) count towards your storage quota.

Resources

NGC offers step-by-step instructions and scripts for creating deep learning models that you can share within teams or the org.

Telemetry

Base Command Platform provides time-series metric data collected from various system components such as GPU, Tensor Cores, CPU, Memory, and I/O.

Workspaces

Workspaces are shareable read-write persistent storage mountable in jobs for concurrent use. Mounting workspaces in read-write mode (which is the default) in a job works well for use as a checkpoint folder. Workspaces can also be mounted to a job in read-only mode, making them ideal for configuration/code/input use cases in the comfort of knowing that the job will not corrupt/modify any of this data.

1.1.1. Security Note

The security of web applications (e.g. JupyterLab) hosted by user jobs and containers is the customer’s responsibility. The Base Command Platform provides a unique URL to access this web application, and ANY user with that URL will have access to that application. Here are a few recommendations to protect your web applications:

  1. Implement appropriate authentication mechanisms to protect your application.

  2. By default, we use a subdomain under nvbcp.com, which is a shared domain, and if you use cookie-based authentication, you are advised to set the cookie against your FQDN, not just the subdomain.

  3. If internal users access the application, you may limit access only from your corporate network, behind the firewall and VPN.

  4. Consider the URL confidential, and only share it with authorized users (unless you have appropriate authentication controls implemented as in (1) above.

2. Onboarding and Signing Up

This chapter walks you through the process of setting up your NVIDIA Base Command Account. In this chapter you will learn about signing up, signing in, installing and configuring CLI, and selecting and switching your team context.

2.1. Inviting Users

This section is for org or team administrators (with User Admin role) and describes the process for inviting (adding) users to NVIDIA Base Command Platform.

As the organization administrator, you must create user accounts to allow others to use the NVIDIA Base Command Platform within the organization.

  1. Log on to the NGC web UI and and select the NGC Org associated with NVIDIA Base Command Platform.

  2. Click Organization > Users from the left navigation menu.

    _images/image38.png

    This capability is available only to User Admins.

  3. Click Invite New User on the top right corner of the page.

    _images/new-ngc-invite-user.png
  4. On the new page, fill out the User Information section. Enter your screen name for First Name, and the email address to receive an invitation email.

    _images/add-user.png
  5. In the Roles section, select the appropriate context (either the organization or a specific team) and the available roles shown in the boxes below. Click Add Role to the right to save your changes. You can add or remove multiple roles before creating the user.

    _images/user-roles.png

    The following are brief descriptions of the user roles:

    NVIDIA Base Command Platform Roles

    Role

    Description

    Base Command Admin

    Admin persona with the capabilities to manage all artifacts available in Base Command Platform. The capabilities of the Admin role include resource allocation and access management.

    Base Command Viewer

    Admin persona with the read-only access to jobs, workspaces, datasets, and results within the user’s org or team.

    Registry Admin

    Registry Admin persona for managing NGC Private Registry artifacts and with the capability for Registry User Management. The capabilities of the Registry Admin role include the capabilities of all Registry roles.

    Registry Read

    Registry User persona with capabilities to only consume the Private Registry artifacts.

    Registry User

    Registry User persona with the capabilities to publish and consume the Private Registry artifacts.

    User Admin

    User Admin persona with the capabilities to only manage users.

    Refer to the section Assigning Roles for additional information.

  6. After adding roles, double-check all the fields and then click Create User on the top right. An invitation email will automatically be sent to the user.

    _images/create-user-btn.png
  7. Users that still need to accept their invitation emails are displayed in the Pending Invitations list on the Users page.

    _images/users-pending-invitations.png

2.2. Joining an NGC Org or Team

Before using NVIDIA Base Command Platform, you must have an NVIDIA Base Command Platform account created by your organization administrator. You need an email address to set up an account. Activating an account depends on whether your email domain is mapped to your organization’s single sign-on (SSO). Choose one of the following processes depending on your situation for activating your NVIDIA Base Command Platform account.

2.2.1. Joining an NGC Org or Team Using Single Sign-on

This section describes activating an account where the domain of your email address is mapped to an organization’s single sign-on.

After NVIDIA or your organization administrator adds you to a new org or team within the organization, you will receive a welcome email that invites you to continue the activation and login process.

_images/image17.png
  1. Click the link in the email to open your organization’s single sign-on page.

  2. Sign in using your single sign-on credentials.

    The Set Your Organization screen appears.

    _images/image33.png

    This screen appears any time you log in.

  3. Select the organization and team under which you want to log in and then click Continue.

    You can always change to a different organization or team you are a member of after logging in.

    The NGC web UI opens to the Base Command dashboard.

    _images/bcp-dashboard.png

2.2.2. Joining an Org or Team with a New NVIDIA Account

This section describes activating a new account where the domain of your email address is not mapped to an organization’s single sign-on.

After NVIDIA or your organization administrator sets up your NVIDIA Base Command account, you will receive a welcome email that invites you to continue the activation and login process.

_images/image17.png
  1. Click the Sign In link to open the sign in dialog in your browser.

    _images/create-an-account.png
  2. Fill out your information, create a password, agree to the Terms and Conditions, and click Create Account.

    You will need to verify your email.

    _images/image6.png

    The verification email is sent.

    _images/image3.png
  3. Open the email and then click Verify Email Address.

    _images/image11.png
    _images/image24.png
  4. Select your options for using recommended settings and receiving developer news and announcements, and then click Submit.

  5. Agree to the NVIDIA Account Terms of Use, select desired options, and then click Continue.

    _images/account-tou.png
  6. Click Accept at the NVIDIA GPU Cloud Terms of Use screen.

    _images/image32.png
  7. The Set Your Organization screen appears.

    _images/image33.png

    This screen appears any time you log in.

  8. Select the organization and team under which you want to log in and click Continue.

    You can always change to a different organization or team you are a member of after logging in.

    The NGC web UI opens to the Base Command dashboard.

    _images/bcp-dashboard.png

2.2.3. Joining an Org or Team with an Existing NVIDIA Account

This section describes activating an account where the domain of your email address is not mapped to an organization’s single sign-on (SSO).

After NVIDIA or your organization administrator adds you to a new org or team within the organization, you will receive a welcome email that invites you to continue the activation and login process.

_images/image17.png
  1. Click the Sign In link to open the sign in dialog in your browser.

    _images/image42.png
  2. Enter your password and then click Log In.

    The Set Your Organization screen appears.

    _images/image33.png

    This screen appears any time you log in.

  3. Select the organization and team under which you want to log in and click Continue.

    You can always change to a different organization or team you are a member of after logging in.

    The NGC web UI opens to the Base Command dashboard.

    _images/bcp-dashboard.png

3. Signing in to Your Account

During the initial account setup, you are signed into your NVIDIA Base Command Platform account on the NGC web site. This section describes the sign in process that occurs at a later time. It also describes the web UI sections of NVIDIA Base Command Platform at a high level, including the UI areas for accessing available artifacts and actions available to various user roles.

  1. Open https://ngc.nvidia.com and click Continue by one of the sign-on choices, depending on your account.

    • NVIDIA Account: Select this option if single sign-on (SSO) is not available.

    • Single Sign-on (SSO): Select this option to use your organization’s SSO. You may need to verify with your organization or Base Command Platform administrator whether SSO is enabled.

    _images/login-selection.png
  2. Continue to sign in using your organization’s single sign-on.

  3. Set the organization you wish to sign in under, then click Continue.

You can always change to a different org or team that you are a member of after logging in.

The following image and table describe the main features in the left navigation menu of the web site, including the controls for changing the org or team.

_images/image31.png
NGC Web UI Sections

ID

Description

1

CATALOG:. Click this menu to access a curated set of GPU-optimized software. It consists of containers, pre-trained models, Helm charts for Kubernetes deployments, and industry-specific AI toolkits with software development kits (SDKs) that are periodically released by NVIDIA and are read-only for a Base Command Platform user.

2

PRIVATE REGISTRY: Click this menu to access the secure space to store and share custom containers, models, resources, and Helm charts within your enterprise.

3

BASE COMMAND:.Click this menu to access controls for creating and running Base Command Platform jobs.

4

ORGANIZATION: (User Admins only) Click this menu to manage users and teams.

5

User Info: Select this drop down list to view user information, select the org to operate under, and download the NGC CLI and API key, described later in this document.

6

Team Selection: Select this drop down list to select which team to operate under.

4. Introduction to the NGC CLI

This chapter introduces the NGC Base Command Platform CLI, installable on your workstation for interfacing with Base Command Platform. In this section you will learn about generic features of CLI applicable to all commands as well as CLI modules that map to the Web UI areas that you have learned about in a previous chapter.

The NGC Base Command Platform CLI is a command-line interface for managing content within the NGC Registry and for interfacing with the NVIDIA Base Command Platform. The CLI operates within a shell and lets you use scripts to automate commands.

With NGC Base Command Platform CLI, you can connect with:

  • NGC Catalog

  • NGC Private Registry

  • User Management (available to org or team User Admins only)

  • NVIDIA Base Command Platform workloads and entities

4.1. About NGC CLI for NVIDIA Base Command Platform

The NGC CLI is available to you if you are logged in with your own NGC account or with an NVIDIA Base Command Platform account, and with it you can:

  • View a list of GPU-accelerated Docker containers available to you as well as detailed information about each container image.

  • See a list of deep-learning models and resources as well as detailed information about them.

  • Download container images, models, and resources.

  • Upload and optionally share container images, models, and resources.

  • Create and manage users and teams (available to administrators).

  • Launch and manage jobs from the NGC registry.

  • Download, upload and optionally share datasets for jobs.

  • Create and manage workspaces for use in jobs.

4.2. Generating Your NGC API Key

This section describes how to obtain an API key needed to configure the CLI application so you can use the CLI to access locked container images from the NGC Catalog, access content from the NGC Private Registry, manage storage entities, and launch jobs.

The NGC API key is also used for docker login to manage container images in the NGC Private Registry with the docker client.

  1. Sign in to the NGC web UI.

    1. From a browser, go to NGC sign in page and then enter your email.

    2. Click Continue by the Sign in with Enterprise sign in option.

    3. Enter the credentials for you organization.

  2. In the top right corner, click your user account icon and then select an org that belongs to the NVIDIA Base Command Platform account.

  3. Click your user account icon again and select Setup.

    _images/image13.png
  4. Click Get API key to open the Setup > API Key page.

  5. Click Get API Key to generate your API key. A warning message appears to let you know that your old API key will become invalid if you create a new key.

  6. Click Confirm to generate the key.

    Your API key appears.

    You only need to generate an API key once. NGC does not save your key, so store it in a secure place. (You can copy your API key to the clipboard by clicking the copy icon to the right of the API key. )

    Should you lose your API key, you can generate a new one from the NGC website. When you generate a new API Key, the old one is invalidated.

4.3. Installing NGC CLI

To install NGC CLI, perform the following:

  1. Log in to your NVIDIA Base Command Platform account on the NGC website (https://ngc.nvidia.com).

  2. In the top right corner, click your user account icon and select an org that belongs to the Base Command Platform account.

  3. From the user account menu, select Setup, then click Downloads under CLI from the Setup page.

  4. From the CLI Install page, click the Windows, Linux, or macOS tab, according to the platform from which you will be running NGC CLI.

  5. Follow the Install instructions that appear on the OS section that you selected.

  6. Verify the installation by entering ngc --version. The output should be NGC CLI x.y.z where x.y.z indicates the version.

4.4. Getting Help Using NGC CLI

This section describes how to get help using NGC CLI.

Note

The ngc batch commands have been replaced with ngc base-command or simply ngc bc. The new commands provide the same functionality as their predecessors. Note that the old ngc batch commands are now deprecated and will be phased out in a future release.

4.4.1. Getting Help from the Command Line

To run an NGC CLI command, enter ngc followed by the appropriate options.

To see a description of available options and command descriptions, use the option -h after any command or option.

Example 1: To view a list of all the available options for the ngc command, enter

$ ngc -h

Example 2: To view a description of all ngc base-command commands and options, enter

$ ngc base-command -h

Example 3: To view a description of the dataset commands, enter

$ ngc dataset -h

4.4.2. Viewing NGC CLI Documentation Online

The NGC Base Command Platform CLI documentation provides a reference for all the NGC Base Command Platform CLI commands and arguments. You can also access the CLI documentation from the NGC web UI by selecting Setup from the user drop down list and then clicking Documentation from the CLI pane.

4.5. Configuring the CLI for your Use

To make full use of NGC Base Command Platform CLI, you must configure it with your API key using the ngc config set command.

While there are options you can use for each command to specify org and team, as well as the output type and debug mode, you can also use the ngc config set command to establish these settings up front.

If you have a pre-existing set up, you can check the current configuration using:

$ ngc config current

To configure the CLI for your use, issue the following:

$ ngc config set
Enter API key. Choices: [<VALID_APIKEY>, 'no-apikey']:
Enter CLI output format type [ascii]. Choices: [ascii, csv, json]:
Enter org [nv-eagledemo]. Choices: ['nv-eagledemo']:
Enter team [nvtest-repro]. Choices: ['nvtest-repro, 'no-team']:
Enter ace [nv-eagledemo-ace]. Choices: ['nv-eagledemo-ace', 'no-ace']:
Successfully saved NGC configuration to C:\Users\jsmith\.ngc\config

If you are a member of several orgs or teams, be sure to select the ones associated with NVIDIA Base Command Platform.

4.5.1. Configuring the Output Format

You can configure the output format when issuing a command by using the --format_type <fmt> argument. This is useful if you want to use a different format than the default ascii, or different from what you set when running ngc config set.

The following are examples of each output format.

Ascii

$ ngc base-command list --format_type ascii
+---------+----------+--------------+------+------------------+----------+----------------+
| Id      | Replicas | Name         | Team | Status             | Duration | Status Details |
+---------+----------+--------------+------+------------------+----------+----------------+
| 1893896 | 1        | helloworld   | ngc  | FINISHED_SUCCESS | 0:00:00  |                   |

CSV

$ ngc base-command list --format_type csv
Id,Replicas,Name,Team,Status,Duration,Status Details
1893896,1,helloworld ml-model.exempt-qsg,ngc,FINISHED_SUCCESS,0:00:00,

JSON

$ ngc base-command list --format_type json
[{
     "aceId": 257,
     "aceName": "nv-us-west-2",
     "aceProvider": "NGN",
    "aceResourceInstance": "dgx1v.16g.1.norm",
     "createdDate": "2021-04-08T01:20:05.000Z",
     "id": 1893896,
     "jobDefinition": {},
     "jobStatus": {],
     "submittedByUser": "John Smith",
    "submittedByUserId": 28166,
     "teamName": "ngc"
}]

4.6. Running the Diagnostics

Diagnostic information is available which provides details to assist in isolating issues. You can provide this information when reporting issues with the CLI to NVIDIA support.

The following diagnostic information is available for the NGC Base Command Platform CLI user:

  • Current time

  • Operating system

  • Disk usage

  • Current directory size

  • Memory usage

  • NGC CLI installation

  • NGC CLI environment variables (whether set and or not set)

  • NGC CLI configuration values

  • API gateway connectivity

  • API connectivity to the container registry and model registry

  • Data storage connectivity

  • Docker runtime information

  • External IP

  • User information (ID, name, and email)

  • User org roles

  • User team roles

Syntax

$ ngc diag [all,client,install,server,user]

where

all

Produces the maximum amount of diagnostic output.

client

Produces diagnostic output only for the client machine.

install

Produces diagnostic output only for the local installation.

server

Produces diagnostic output only for the remote server.

user

Produces diagnostic output only for the user configuration.

4.7. Specifying List Columns

Some commands provide lists, such as a list of registry images or a list of batch jobs.

Examples:

ngc base-command list

ngc dataset list

ngc registry image list

ngc registry model list

ngc registry resource list

ngc workspace list

The default information includes several columns of information which can appear cluttered, especially if you are not interested in all the information.

For example, the ngc base-command list command provides the following columns:

+----+----------+------+------+--------+----------+----------------+
| Id | Replicas | Name | Team | Status | Duration | Status Details |
+----+----------+------+------+--------+----------+----------------+

You can restrict the output to display only the columns that you specify using the --column argument.

For example, to display only the Name, Team, and Status, enter

$ ngc base-command list --column name --column team --column status
+----+------+------+--------+
| Id | Name | Team | Status |
+----+------+------+--------+

Note

The Id column will always appear and does not need to be specified.

Consult the help for the --column argument to determine the exact values to use for each column.

4.8. Other Useful Command Options

4.8.1. Automatic Interactive Command Process

Use the -y argument to insert a yes (y) response to all interactive questions.

Example:

$ ngc workspace share --team <team> -y <workspace>

4.8.2. Testing a Command

Some commands support the --dry-run argument. This argument produces output that describes what to expect with the command.

Example:

$ ngc result remove 1893896 --dry-run
Would remove result for job ID: 1893896 from org: <org>

Use the -h argument to see if a specific command supports the --dry-run argument.

5. Using NGC APIs

This section provides an example of how to use NGC Base Command Platform APIs. For a detailed list of the APIs, refer to the NGC API Documentation.

5.1. Example of Getting Basic Job Information

This example shows how to get basic job information. It shows the API method for performing the steps that correspond to the NGC Base Command Platform CLI command

ngc base-command get-json {job-id}

5.1.1. Using Get Request

The following is the flow using the API Get requests.

  1. Get valid authorization.

    Send a GET request to https://authn.nvidia.com/token to get a valid token.

  2. Get the job information.

    Send a GET request to https://api.ngc.nvidia.com/v2/org/{org-name}/jobs/{job-id} with the token returned from the first request.

  3. Another ask step.

5.1.2. Code Example of Getting a Token

The following is a code example of getting valid authorization (token).

Note

API_KEY is the key obtained from the NGC web UI and should be present in your NGC config file if you have used the CLI.

#!/usr/bin/python3
import os, base64, json, requests

def ngc_get_token(org='nv-eagledemo', team=None):
   '''Use the api key set environment variable to generate auth token'''

   scope = f'group/ngc:{org}'
   if team: #shortens the token if included
       scope += f'/{team}'

   querystring = {"service": "ngc", "scope": scope}
   auth = '$oauthtoken:{0}'.format(os.environ.get('API_KEY'))

  headers = {
       'Authorization': 'Basic {}'.format(base64.b64encode(auth.encode('utf-8')).decode('utf-8')),
       'Content-Type': 'application/json',
       'Cache-Control': 'no-cache',
   }

   url = 'https://authn.nvidia.com/token'

   response = requests.request("GET", url, headers=headers, params=querystring)

   if response.status_code != 200:
       raise Exception("HTTP Error %d: from '%s'" % (response.status_code, url))
   return json.loads(response.text.encode('utf8'))["token"]

Example output of the auth response:

{
  "token": "eyJraW...",
  "expires_in": 600
}

5.1.3. Code Example of Getting Job Information

The token is the output of the function in the Getting a Token section.

def ngc_get_jobinfo(token=None, jobid=None, org=None):

   url = f'https://api.ngc.nvidia.com/v2/org/{org}/jobs/{jobid}'

   headers = {
     'Content-Type': 'application/json',
     'Authorization': f'Bearer {token}'
   }

   response = requests.request("GET", url, headers=headers)

   if response.status_code != 200:
       raise Exception("HTTP Error %d: from '%s'" % (response.status_code, url))

   return response.json()

Output of the job info

{
  "job": {
    "aceId": 357,
    "aceName": "nv-eagledemo-ace",
    "aceProvider": "NGN",
    "aceResourceInstance": "dgxa100.40g.1.norm",
    "createdDate": "2021-06-04T16:14:31.000Z",
    "datasets": [],
    "gpuActiveTime": 1,
    "gpuUtilization": 0,
    "id": 2039271,
    "jobDefinition": {
      "aceId": 357,
      "clusterId": "eagle-demo.nvk8s.com",
      "command": "set -x; jupyter lab --NotebookApp.token='' --notebook-dir=/ --NotebookApp.allow_origin='*' & date; nvidia-smi; echo $NVIDIA_BUILD_ID; sleep 1d",
      "datasetMounts": [],
      "dockerImage": "nvidia/pytorch:21.02-py3",
      "jobDataLocations": [
        {
          "accessRights": "RW",
          "mountPoint": "/result",
          "protocol": "NFSV3",
          "type": "RESULTSET"
        },
        {
          "accessRights": "RW",
          "mountPoint": "/result",
          "protocol": "NFSV3",
          "type": "LOGSPACE"
        }
      ],
      "jobType": "BATCH",
      "name": "NVbc-jupyterlab",
      "portMappings": [
        {
          "containerPort": 8888,
          "hostName": "https://kpog9271.eagle-demo.proxy.ace.ngc.nvidia.com",
          "hostPort": 0
        }
      ],
      "replicaCount": 1,
      "resources": {
        "cpuCores": 30,
        "gpus": 1,
        "name": "dgxa100.40g.1.norm",
        "systemMemory": 124928
      },
      "resultContainerMountPoint": "/result",
      "runPolicy": {
        "minTimesliceSeconds": 3600,
        "preemptClass": "RESUMABLE",
        "totalRuntimeSeconds": 72000
      },
      "useImageEntryPoint": false,
      "workspaceMounts": []
    },
    "jobStatus": {
      "containerName": "6a977c9461f228b875b800acd6ced1b9a14905a46fca62c5bdbc393409bebe2d",
      "createdDate": "2021-06-04T20:05:19.000Z",
      "jobDataLocations": [
        {
          "accessRights": "RW",
          "mountPoint": "/result",
          "protocol": "NFSV3",
          "type": "RESULTSET"
        },
        {
          "accessRights": "RW",
          "mountPoint": "/result",
          "protocol": "NFSV3",
          "type": "LOGSPACE"
        }
      ],
      "portMappings": [
        {
          "containerPort": 8888,
          "hostName": "https://kpog9271.eagle-demo.proxy.ace.ngc.nvidia.com",
          "hostPort": 0
        }
      ],
      "resubmitId": 0,
      "selectedNodes": [
        {
          "ipAddress": "ww.x.yy.zz",
          "name": "node-02",
          "serialNumber": "ww.x.yy.zz"
        }
      ],
      "startedAt": "2021-06-04T16:14:42.000Z",
      "status": "RUNNING",
      "statusDetails": "",
      "statusType": "OK",
      "totalRuntimeSeconds": 14211
    },
    "lastStatusUpdatedDate": "2021-06-04T20:05:19.000Z",
    "orgName": "nv-eagledemo",
    "resultset": {
      "aceName": "nv-eagledemo-ace",
      "aceStorageServiceUrl": "https://nv-eagledemo.dss.ace.ngc.nvidia.com",
      "createdDate": "2021-06-04T16:14:31.000Z",
      "creatorUserId": "99838",
      "creatorUserName": "K Kris",
      "id": "2039271",
      "orgName": "nv-eagledemo",
      "owned": true,
      "shared": false,
      "sizeInBytes": 2662,
      "status": "COMPLETED",
      "updatedDate": "2021-06-04T20:05:19.000Z"
    },
    "submittedByUser": "K Kris",
    "submittedByUserId": 99838,
    "teamName": "nvbc-tutorials",
    "workspaces": []
  },
  "jobRequestJson": {
    "dockerImageName": "nvidia/pytorch:21.02-py3",
    "aceName": "nv-eagledemo-ace",
    "name": "NVbc-jupyterlab",
    "command": "set -x; jupyter lab --NotebookApp.token\\u003d\\u0027\\u0027 --notebook-dir\\u003d/ --NotebookApp.allow_origin\\u003d\\u0027*\\u0027 \\u0026 date; nvidia-smi; echo $NVIDIA_BUILD_ID; sleep 1d",
    "replicaCount": 1,
    "publishedContainerPorts": [
      8888
    ],
    "runPolicy": {
      "minTimesliceSeconds": 3600,
      "totalRuntimeSeconds": 72000,
      "preemptClass": "RESUMABLE"
    },
    "workspaceMounts": [],
    "aceId": 357,
    "datasetMounts": [],
    "resultContainerMountPoint": "/result",
    "aceInstance": "dgxa100.40g.1.norm"
  },
  "jobStatusHistory": [
    {
      "containerName": "6a977c9461f228b875b800acd6ced1b9a14905a46fca62c5bdbc393409bebe2d",
      "createdDate": "2021-06-04T20:05:19.000Z",
      "jobDataLocations": [],
      "portMappings": [
        {
          "containerPort": 8888,
          "hostName": "https://kpog9271.eagle-demo.proxy.ace.ngc.nvidia.com",
          "hostPort": 0
        }
      ],
      "resubmitId": 0,
      "selectedNodes": [
        {
          "ipAddress": "10.0.66.70",
          "name": "node-02",
          "serialNumber": "10.0.66.70"
        }
      ],
      "startedAt": "2021-06-04T16:14:42.000Z",
      "status": "RUNNING",
      "statusDetails": "",
      "statusType": "OK",
      "totalRuntimeSeconds": 14212
    },
    {
      "createdDate": "2021-06-04T16:14:39.000Z",
      "jobDataLocations": [],
      "portMappings": [
        {
          "containerPort": 8888,
          "hostName": "",
          "hostPort": 0
        }
      ],
      "resubmitId": 0,
      "selectedNodes": [
        {
          "ipAddress": "10.0.66.70",
          "name": "node-02",
          "serialNumber": "10.0.66.70"
        }
      ],
      "status": "STARTING",
      "statusDetails": "",
      "statusType": "OK"
    },
    {
      "createdDate": "2021-06-04T16:14:36.000Z",
      "jobDataLocations": [],
      "portMappings": [
        {
          "containerPort": 8888,
          "hostName": "",
          "hostPort": 0
        }
      ],
      "resubmitId": 0,
      "selectedNodes": [],
      "status": "QUEUED",
      "statusDetails": "Resources Unavailable",
      "statusType": "OK"
    },
    {
      "jobDataLocations": [],
      "selectedNodes": [],
      "status": "CREATED"
    }
  ],
  "requestStatus": {
    "requestId": "f7fbc3ff-36cf-4676-84a0-3d332b4091b1",
    "statusCode": "SUCCESS"
  }
}

5.1.4. Code Example of Getting Telemetry Data

The token is the output from the Get Token section.

#!/usr/bin/python3
# INFO: Before running this you must run 'export API_KEY=<ngc api key>' in your terminal
import os, json, base64, requests
def get_token(org='nv-eagledemo', team=None):
   '''Use the api key set environment variable to generate auth token'''
   scope = f'group/ngc:{org}'
   if team: #shortens the token if included
       scope += f'/{team}'
   querystring = {"service": "ngc", "scope": scope}
   auth = '$oauthtoken:{0}'.format(os.environ.get('API_KEY'))
   auth = base64.b64encode(auth.encode('utf-8')).decode('utf-8')
   headers = {
       'Authorization': f'Basic {auth}',
       'Content-Type': 'application/json',
       'Cache-Control': 'no-cache',
   }
   url = 'https://authn.nvidia.com/token'
   response = requests.request("GET", url, headers=headers, params=querystring)
   if response.status_code != 200:
       raise Exception("HTTP Error %d: from '%s'" % (response.status_code, url))
   return json.loads(response.text.encode('utf8'))["token"]
def get_job(job_id, org, team, token):
   '''Get general information for a specific job'''
   url = f'https://api.ngc.nvidia.com/v2/org/{org}/jobs/{job_id}'
   headers = {
       'Content-Type': 'application/json',
       'Authorization': f'Bearer {token}'
   }
   response = requests.request("GET", url, headers=headers)
   if response.status_code != 200:
       raise Exception("HTTP Error %d: from '%s'" % (response.status_code, url))
   return response.json()
def get_telemetry(job_id, start, end, org, team, token):
   '''Get telemetry information for a specific job'''
   url = f'https://api.ngc.nvidia.com/v2/org/{org}/jobs/{job_id}/telemetry'
   # INFO: See the docs for full list of telemetry
   vals = {
       'measurements': [
       {
           "type":"APPLICATION_TELEMETRY",
           "aggregation":"MEAN",
           "toDate": end,
           "fromDate": start,
           "period":60
       },{
           "toDate": end,
           "period": 60,
           "aggregation": "MEAN",
           "fromDate": start,
           "type": "GPU_UTILIZATION"
       }]
   }
   params = {'q': json.dumps(vals)}
   headers = {
       'Content-Type': 'application/json',
       'Authorization': f'Bearer {token}'
   }
   response = requests.request("GET", url, params=params, headers=headers)
   if response.status_code != 200:
       raise Exception("HTTP Error %d: from '%s'" % (response.status_code, url))
   return response.json()
# Get org/team information from account setup
org = 'nv-eagledemo'
team='nvbc-tutorials'
# Get job ID from GUI, CLI, or other API calls
job_id = 'TODO'
# Generate a token
token = get_token(org, team)
print(token)
# Get general job info for the job of interest
job_info = get_job(job_id, org, team, token)
print(json.dumps(job_info, indent=4, sort_keys=True))
# Get all job telemetry for the job of interest
telemetry = get_telemetry(job_id,
                         job_info['job']['createdDate'],
                         job_info['job']['jobStatus']['endedAt'],
                         org, team, token)
print(json.dumps(telemetry, indent=4, sort_keys=True))

5.2. List of API Endpoints

By using the --debug flag in the CLI you can see what endpoints and arguments are used for a given command.

The listed endpoints are all for GET requests but other methods (POST, PATCH, etc…) are supported for different functions. More information can be found here: https://docs.ngc.nvidia.com/api/

Section

Endpoints

Description

User Management

/v2/users/me

Get information pertaining to your user such as roles in all teams, datasets, and workspaces that you can access

User Management

/v2/org/{org-name}/teams/{team-name}

Get description and id of {team-name}

User Management

/v2/org/{org-name}/teams

Get a list of your teams in {org-name}

User Management

/v2/orgs

Get a list of orgs that you can access

Jobs

/v2/org/{org-name}/jobs/{id}

Get detailed information about the job, including all create job options, and status history

Jobs

/v2/org/{org-name}/jobs

Get a list of jobs

Jobs

/v2/org/{org-name}/jobs/*

There are many more job commands in the above link that allow you to control jobs

Datasets

/v2/org/{org-name}/datasets

Get a list of accessible datasets in {org-name}

Datasets

/v2/org/{org-name}/datasets/{id}

Get information about a dataset including a list of its files

Datasets

/v2/org/{org-name}/datasets/{id}/file/**

Download a file from the dataset

Telemetry

/v2/org/{org-name}/jobs/{id}/telemetry

Get telemetry information about the job.

Telemetry

/v2/org/{org-name}/measurements/jobs/{id}/[cpu|gpu|memory]/[allocation|utilization]

Individual endpoints for specific type of telemetry information

Workspaces

/v2/org/{org-name}/workspaces

Get a list of accessible workspaces

Workspaces

/v2/org/{org-name}/workspaces/{id-or-name}

Get basic information about the workspace

Workspaces

/v2/org/{org-name}/workspaces/{id-or-name}/file/**

Download a file from the workspace

Job Templates

/v2/org/{org-name}/jobs/templates/{id}

Get info about a job template

6. NGC Catalog

This chapter describes the NGC Catalog features of Base Command Platform. NGC Catalog, a collection of software published regularly by NVIDIA and Partners, is accessible through Base Command Platform Web UI and CLI. In this chapter you will learn how to identify and use the published artifacts with Base Command Platform either as is or as a basis for building and publishing your own container images and models.

NGC provides a catalog of NVIDIA and partner published artifacts optimized for NVIDIA GPUs.

These are a curated set of GPU-optimized software. It consists of containers, pre-trained models, Helm charts for Kubernetes deployments, and industry-specific AI toolkits with software development kits (SDKs).

Artifacts from NGC Catalog are periodically updated and can be used as a basis for building custom containers for Base Command Platform jobs.

6.1. Accessing NGC Catalog

After logging into the NGC website, click CATALOG from the left-side menu then click one of the options from the top ribbon menu.

_images/image10.png
  • Collections: Presents collections of deep learning and AI applications.

  • Containers: Presents the list of NGC container images.

  • Helm Charts: Presents a list of Helm charts.

  • Models: Presents the list of pre-trained deep learning models that can be easily re-trained or fine-tuned.

  • Resources: Provides a list of step-by-step instructions and scripts for creating deep learning models.

You can also use the filter bar to build a search filter and sorting preference.

6.2. Viewing Detailed Application Information

Each card displays the container name and a brief description.

  • Click the Pull Tag or Fetch Helm Chart link(depending on the artifact) to copy the pull or fetch command to your clipboard. Artifacts with a Download link will be downloaded to your local disk when the link is clicked.

  • Click the artifact name to open to the detailed page.

    The top portion of the detailed page shows basic publishing information for the artifact.

    The bottom portion of the detailed page shows additional details about the artifact.

6.3. Using the CLI

To see a list of container images using the CLI, issue the following command.

$ ngc registry image list
+------+--------------+---------------+------------+--------------+------------+
| Name | Repository   | Latest Tag    | Image Size | Updated Date | Permission |
+------+--------------+---------------+------------+--------------+------------+
| CUDA | nvidia/cuda  | 11.2.1-devel- | 2.18 GB    | Feb 17, 2021 | unlocked   |
|      |              | ubuntu20.04   |            |              |            |
...

Other Examples

To see a list of container images for PyTorch, issue the following.

$ ngc registry image list nvidia/pytorch*
+---------+----------------+------------+------------+--------------+------------+
| Name    | Repository     | Latest Tag | Image Size | Updated Date | Permission |
+---------+----------------+------------+------------+--------------+------------+
| PyTorch | nvidia/pytorch | 21.03-py3  | 5.89 GB    | Mar 26, 2021 | unlocked   |
+---------+----------------+------------+------------+--------------+------------+

To see a list of container images under the partners registry space, issue the following.

$ ngc registry image list partners/*
+-------------------+---------------------+--------------+------------+------------+----------+
| Name              | Repository          | Latest Tag   | Image Size |Updated Date|Permission|
+-------------------+---------------------+--------------+------------+------------+----------+
| OmniSci (MapD)    | partners/mapd       | None         | None       |Sep 24, 2020| unlocked |
| H2O Driverless AI | partners/h2oai-     | latest       | 2 GB       |Sep 24, 2020| unlocked |
|                   | driverless          |              |            |            |          |
| PaddlePaddle      | partners/paddlepadd | 0.11-alpha   | 1.28 GB    |Sep 24, 2020| unlocked |
|                   | le                  |              |            |            |          |
| Chainer           | partners/chainer    | 4.0.0b1      | 963.75 MB  |Sep 24, 2020| unlocked |
| Kinetica          | partners/kinetica   | latest       | 5.35 GB    |Sep 24, 2020| unlocked |
| MATLAB            | partners/matlab     | r2020b       | 9.15 GB    |Jan 08, 2021| unlocked |
...

7. NGC Private Registry

This chapter describes the Private Registry, a dedicated registry space allocated and accessible just for your organization, which is available to you as a Base Command Platform user. In this chapter, you will learn how to identify your team or org space, how to share container images and models with your team or org, and how to download and use those in your workloads on Base Command.

NGC Private Registry has the same set of artifacts and features available in NGC Catalog. Private Registry provides the space for you to upload, publish, and share your custom artifacts with your team and org with the ability to control access based on the team and org membership. Private Registry enables your org to have your own Catalog accessible only to your org users.

7.1. Accessing the NGC Private Registry

Set your org and team from the User and Select a Team drop-down menus, then click Private Registry from the left-side menu.

_images/image58.png

Click the menu item to view a list of the corresponding artifacts available to your org or team.

Click Create to open the screen where you can create the corresponding artifact and save it to your org or team.

Example of Container Create page

_images/image37.png

Example of Model Create page

_images/image7.png

7.2. Building and Sharing Private Registry Container Images

This section describes how to use a Dockerfile to customize a container from the NGC Private Registry and then push it to a shared registry space in the private registry.

Note

These instructions describe how to select a container image from your org and team registry space, but you can use a similar process for modifying container images from the NGC Catalog.

  1. Select a container image to modify.

    1. Log into the NGC website, selecting the org and team under which you want to obtain a container image.

    2. Click PRIVATE REGISTRY > Containers from the left-side menu, then click either ORGANIZATION CONTAINERS or TEAM CONTAINERS, depending on who you plan to share your container image with.

    3. Locate the container to pull, then click Pull tag to copy the pull command to the clipboard.

  2. Pull the container image using the command copied to the clipboard.

  3. You can use any method to edit or create containers to push to the NGC Private Registry as long as the image name follows the naming conventions. For example, running the container and changing it from the inside.

    1. Run the container with the Docker run command:

      $ docker run -it -name=pytorch nvcr.io/<org>/<team>/<container-name>:<tag> bash
      
    2. Make any changes to the container (install packages or create/download files).

    3. Commit the changes into a new image.

      $ docker commit pytorch nvcr.io/<org>/<team>/<container-name>:<new-tag>
      
  4. Alternatively, you can use a Dockerfile to make changes.

    1. On your workstation with Docker installed, create a subdirectory called mydocker. This is an arbitrary directory name.

    2. Inside this directory, create a file called Dockerfile (capitalization is important). This is the default name that Docker looks for when creating a container. The Dockerfile should look similar to the following:

      $ mkdir mydocker
      $ cd mydocker
      $ vi Dockerfile
      $ more Dockerfile
      # This is the base container for the new container.
      FROM nvcr.io/<org>/<team>/<container-name>:<tag>
      # Update the apt-get database
      RUN apt-get update
      # Install the package octave with apt-get
      RUN apt-get install -y octave
      $
      
    3. Build the docker container image.

      $ docker build -t nvcr.io/<org>/<team>/<container-name>:<new-tag> .
      

      Note

      This command uses the default file Dockerfile for creating the container. The command starts with docker build. The -t option creates a tag for this new container. Notice that the tag specifies the org and team registry spaces in the nvcr.io repository where the container will be stored.

  5. Verify that Docker successfully created the image.

    $ docker images
    
  6. Push the image into the repository, creating a container.

    docker push nvcr.io/<org>/<team>/<container-image>:<new-tag>
    
  7. At this point, you should log into the NGC container registry at https://ngc.nvidia.com and look under your team space to see if the container is there.

    If the container supports multi-node:

    1. Open the container details page, click the menu icon from the upper right corner, then click Edit Details.

    2. Click the Multi-node Container check box.

    3. Click the menu icon and then click Save.

If you don’t see the container in your team space, make sure that the tag on the image matches the location in the repository. If, for some reason, the push fails, try it again in case there was a communication issue between your system and the container registry (nvcr.io).

8. NGC Secrets

NGC Secrets is a secure vault/repository for storing sensitive information that allows you to easily identify or authenticate with external systems. It provides a reliable and straightforward way to create, manage, and add hidden environment variables to your jobs. Some primary use cases include storing API keys, tokens, usernames and passwords, and encryption keys.

Additional Information

  • Can be up to 64 characters long and include alphanumeric characters and the following symbols: ^._-+:#&

  • One user can have up to 100 secrets

  • Secret names

    • Names starting with “_” are reserved for special use cases

    • Names starting with “__” are reserved for use by system admins

    • Names cannot be changed once created, they will need to be recreated

  • Secret keys, values, and descriptions are each limited to 256 characters

  • Individual keys and values cannot be edited but can be individually removed and re-added

8.1. Setting up Secrets in the Web UI

To manage secrets in the Base Command Platform web application, click your user account icon on the top right of the page and select Setup.

_images/bcp-user-setup.png

Then click on View Secrets to go to the secrets page.

_images/bcp-setup-panel.png

In the initial Secrets page, click on Add Secret to bring up the Secret Details pane.

_images/bcp-secret-details-panel-small.png

When creating a secret, the Name will be the identifier for a collection of key-value pairs and the Key will be the name of the environment variable created in the job.

Using Secrets in a Job

When creating a job in the web UI, you can add secrets in the Secrets section. In it, you can select the entire secret with all their key-value pairs or a subset. Additionally, mousing over the rightmost portion of the row will reveal the option to override the key. Secrets will be made available as environment variables.

_images/bcp-secret-job-creation.png

8.2. Setting up Secrets in the CLI

You can use the NGC CLI to perform all the same actions as in the Base Command Platform web application. CRUD operations are supported with the ngc user secret [create|info|update|delete|list] commands.

To see a description of available options and command descriptions, use the option -h after any command or option.

Example 1: Creating a secret.

$ ngc user secret create WANDB_SECRET --desc "Wandb secret" \
    --pair "WANDB_API_KEY:ABC123"

Example 2: Creating a secret with multiple pairs.

$ ngc user secret create AWS_SECRET --desc "AWS secret" --pair "USERNAME:XYZ123" --pair "PASSWORD:ABC456" --pair "API_KEY:KEY_123"

You can add secrets to jobs with the --secret flag. You can access them from inside the job as an environment variable accessed by their key names.

Example 1: Adding a secret by name will add all its keys to the job.

$ ngc base-command run … --secret WANDB_SECRET

Example 2: To add only a specific key within a secret, specify the key name as below.

$ ngc base-command run … --secret "GITHUB_SECRET:USERNAME"

Example 3: It is also possible to override keys for individual secrets.

$ ngc base-command run … --secret "WANDB_SECRET" \
        --secret "GITHUB_SECRET:USERNAME:GITHUB_USERNAME" \
        --secret "GITHUB_SECRET:PASSWORD:GITHUB_PASSWORD" \
        --secret "AWS_SECRET:USERNAME:AWS_USERNAME" \
        --secret "AWS_SECRET:PASSWORD:AWS_PASSWORD"

9. Org, Team, and User Management

This chapter applies to organization and team administrators, and explains the tasks that an organization or team administrator can perform from the NGC website or CLI. In this chapter, you will learn about the different user roles along with their associate scopes and permissions available in Base Command Platform, and the features to manage users and teams.

9.1. Org and Team Overview

Every enterprise is assigned to an “org”, the name of which is determined by the enterprise at the time the account is set up. NVIDIA Base Command Platform provides each org with its own private registry space for running jobs, including storage and workspaces.

One or more teams can be created within the org to provide private access for groups within the enterprise. Individual users can be members of any number of teams within the org.

As the NVIDIA Base Command Platform administrator for your organization, you can invite other users to join your organization’s NVIDIA Base Command Platform account. Users can then be assigned as members of teams within your organization. Teams are useful for keeping custom work private within the organization.

The following table illustrates the interrelationship between orgs, teams, and users:

ORG

Registry Space

<org>/

Org Admin

Can add users to the org/, or to any org/team. Can create teams.

Org User

Can access resources and launch jobs within the org, but not within teams.

Org Viewer

Can read resources and jobs within the org.

TEAM 1

TEAM 2

TEAM 3

Registry Space

<org>/<team1>

<org>/<team2>

<org>/<team3>

Team Admin

Can add users to org/team1

Can add users to org/team2

Can add users to org/team3

Team User

Can access and share resources and launch jobs within org/team1

Can access and share resources and launch jobs within org/team2

Can access and share resources and launch jobs within org/team3

Team Viewer

Can read resources and jobs within org/team1

Can read resources and jobs within org/team2

Can read resources and jobs within org/team3

The general workflow for building teams of users is as follows:

  1. The organization admin invites users to the organization’s NVIDIA Base Command account.

  2. The organization admin creates teams within the organization.

  3. The organization admin adds users to appropriate teams, and typically assigns at least one user to be the team admin.

  4. The organization or team admin can then add other users to the team.

9.2. NVIDIA Base Command Platform User Roles

Prior to adding users and teams, familiarize yourself with the following descriptions of each role.

9.2.1. Base Command Admin

The Base Command Admin (BASE_COMMAND_ADMIN) is the role assigned to the Base Command Platform org administrator for the enterprise.

The following is a summary of the capabilities of the org administrator:

  • Access to all read-write and appropriate share commands involving the following features:

    Jobs, workspaces, datasets, and results within the org.

  • Team administrators have the same capabilities as the org administrator with the following limits:

    Capabilities are limited to the specific team.

9.2.2. Base Command User Role

The Base Command User role (BASE_COMMAND_USER) can make use of all NVIDIA Base Command Platform tasks. This includes all read, write, and appropriate sharing capabilities for jobs, workspaces, datasets, and results within the user’s org or team.

9.2.3. Base Command Viewer Role

The Base Command Viewer user (BASE_COMMAND_VIEWER) has the same scope as the Base Command User but with read-only access to all jobs, workspaces, datasets, and results within the scope of the role (org or team).

9.2.4. Registry Admin Role

The Registry Admin (REGISTRY_USER_ADMIN) is the role assigned to the initial org administrator for the enterprise.

The following is a summary of the capabilities of the registry admin org administrator

  • Access to all read-write and appropriate share commands involving the following features:

    Containers, models, and resources within the org

Team administrators have the same capabilities as the org administrator with the following limits:

  • Capabilities are limited to the specific team.

  • Team administrators cannot create other teams or delete teams

9.2.5. Registry Read Role

The Registry Read (REGISTRY_READ) role has read-only access to containers, models, and resources within the user’s org or team.

9.2.6. Registry User Role

The Registry User (REGISTRY_USER_USER) can make full use of all Private Registry features. This includes all read, write, and appropriate sharing capabilities for containers, models, and resources within the user’s org or team.

9.2.7. User Admin Role

The User Admin (USER_ADMIN) user manages users within the org or team. The User Admin for an org can create teams within that org.

9.2.8. User Read Role

The User Read (USER_READ) user can view details within the org or team.

9.3. Assigning Roles

Each role is targeted for specific capabilities. When assigning roles, keep in mind all the capabilities you want the user or admin to achieve. Most users and admins will need to be assigned multiple roles. Use the following tables for guidance:

9.3.1. Assigning Admin Roles

Refer to the following table for a summary of the capabilities of each admin role. You may need to assign multiple roles depending on the capabilities you want the admin to have.

Role

Users or Teams

Jobs, Workspaces, datasets, results

Container, models, resources

Base Command Admin

N/A

Read/Write

N/A

Base Command Viewer

N/A

Read Only

N/A

Registry Admin

N/A

N/A

Read/Write

User Admin

Read/Write

N/A

N/A

Example: To add an admin for user management, registry management, and job management, issue the following:

$ ngc org add-user <email> <name> --role USER_ADMIN --role REGISTRY_USER_ADMIN --role BASE_COMMAND_ADMIN

9.3.2. Assigning User Roles

Refer to the following table for a summary of the capabilities of each user role. You may need to assign multiple roles depending on the capabilities you want the user to have.

Role

Users

Jobs, Workspaces, datasets, results

Container, models, resources

Base Command User

N/A

Read/Write

N/A

Registry Read

N/A

N/A

Read Only

Registry User

N/A

N/A

Read/Write

Example: To add a user who can run jobs using custom containers, issue the following:

$ ngc org add-user <email> <name> --role BASE_COMMAND_USER --role REGISTRY_USER

9.4. Org and Team Administrator Tasks

For org or team admins the most likely commands needed are adding users. The following is the typical process for adding users using the CLI.

  1. Add a user to an org:

    $ ngc org add-user <email> <name> --role <user-role>
    
  2. Create a team:

    $ ngc org add-team <name> <description>
    
  3. Add a User to a team (and to the org if they are not already a member):

    $ ngc team add-user --team <team> <email> <name> --role <user-role>
    

Other commands to list users, add additional admins, can be looked up with

ngc org --help

or

ngc team --help

or in the CLI documentation.

9.4.1. Managing Teams

You can create and remove teams using the web interface.

9.4.1.1. Creating Teams Using the Web UI

Creating teams is useful for allowing users to share images within a team while keeping them invisible to other teams in the same organization. Only organization administrators can create teams.

To create a team, do the following:

  1. Log on to the NGC website.

  2. Select Organization > Teams from the left navigation menu.

  3. Click the Create Team menu on the top right of the page.

    _images/new-ngc-create-team.png
  4. In the Create Team dialog, enter a team name and description, then click Create Team.

    _images/image9.png
9.4.1.2. Removing Teams Using the Web UI

Deleting a team will revoke access to resources shared within the team. Any resources not associated with the team will remain unaffected. Only organization administrators can delete teams.

To remove a team, do the following:

  1. Log on to the NGC website.

  2. Select Organization > Teams from the left navigation menu.

  3. From the list, select the team you wish to delete to go to its page.

  4. Click the vertical ellipsis in the top right corner and select Delete Team.

    _images/ngc-delete-team.png
  5. Confirm your choice.

9.4.2. Managing Users

You can create and remove users using the web interface.

9.4.2.1. Creating Users Using the Web UI

As the organization administrator, you must create user accounts to allow others to use the NVIDIA Base Command Platform within the organization.

  1. Log on to the NGC website.

  2. Click Organization > Users from the left navigation menu.

  3. Click Invite New User on the top right corner of the page.

    _images/new-ngc-invite-user.png
  4. On the new page, fill out the User Information section. Enter your screen name for First Name, and the email address to receive an invitation email.

    _images/add-user.png
  5. In the Roles section, select the appropriate context (either the organization or a specific team) and the available roles shown in the boxes below. Click Add Role to the right to save your changes. You can add or remove multiple roles before creating the user.

    _images/user-roles.png
  6. After adding roles, double-check all the fields and then click Create User on the top right. An invitation email will automatically be sent to the user.

    _images/create-user-btn.png
9.4.2.2. Removing Users Using the Web UI

An organization administrator might need to remove a user if that user leaves the company.

Deleting a user will disable any shared resources and revoke access to the user’s shared workspaces and datasets for all team members.

  1. To remove a user, do the following:

  2. Log on to the NGC website.

  3. Click Organization > Users from the left navigation menu.

  4. From the list, select the user you wish to delete to go to its page.

  5. Click Remove User on the top right corner of the page.

  6. Confirm your choice.

    _images/ngc-remove-user.png

10. NVIDIA Base Command Platform Data Concepts

This chapter describes the storage data entities available in NVIDIA Base Command Platform. In this chapter, you will learn datasets, workspaces, results, and storage space local to a computing instance along with their use cases. You will learn about actions that you can perform on these data storage entities from within a computing instance and from your workstations, both from the Web UI and from the CLI.

10.1. Data Types

NVIDIA Base Command Platform has the following data types on network storage within the ACE:

  • Dataset: Shareable read-only artifact, mountable to a job. Data persists after job completion, and is identical for each replica.

  • Workspace: Shareable read-write artifact, mountable to a job. Data persists after job completion, and is identical for each replica.

  • Result: Private to a job, read-write artifact, automatically generated for each replica in a job. Data persists after job completion, and is unique for each replica.

Tip

If shared storage that is the same across all replicas is necessary for a multi-replica job’s custom result data, use a Workspace for this purpose.

  • Local scratch space: Private to a replica, read-write local scratch space. Data does not persist after job completion, and is unique for each replica.

  • Secrets: Encrypted tokens and passwords for 3rd-party authentication. Data persists after job completion, and is identical for each replica.

Important

In addition to local scratch space, all other storage paths within a container will not persist new data or augmented data once a job is completed.

For example, if a user writes data to /mnt/ in a job, and /mnt was not used as a path for a Workspace or a Result, the written data will not be present in future job runs, even if the job is an exact clone of the previous job.

10.2. Managing Datasets

Datasets are intended for read-only data suitable for production workloads with repeatability, provenance, and scalability. They can be shared with your team or entire organization.

10.2.1. Determining Datasets by Org or Team

To view a list of datasets using the NGC website, click Datasets from the left-side menu, then select one of the tabs from the ribbon menu, depending on whether you want to view all datasets available to you, only datasets available to your org, or only datasets available to your team.

_images/image47.png

10.2.2. Mounting Datasets in a Job

Datasets are a critical part of a deep learning training job. They are intended as performant shareable read-only data suitable for production workload with repeatability and scalability. Multiple datasets can be mounted to the same job. Multiple jobs and users can mount a dataset concurrently.

To mount one or more datasets, specify the datasets and mount points from the NGC Job Creation page when you create a new job.

Mounting datasets in a job
  1. From the Data Input section, select the Datasets tab and then search for a dataset to mount using the available search criteria.

  2. Select one or more datasets from the list.

  3. Specify a unique mount point for each dataset selected.

10.2.3. Downloading a Dataset Using the Web UI

To download a dataset using the NGC website, select a dataset from the list to open the details page for the selected dataset.

Click the File Browser tab, then select one of the files to download.

The file will download to your Download folder.

10.2.4. Managing Datasets Using the NGC CLI

10.2.4.1. Uploading and Sharing a Dataset

Creating, uploading, and optionally sharing a dataset is done in one step:

$ ngc dataset upload --source <dir> --desc "my data" <dataset_name> [--share <team_name>]

Example:

$ ngc dataset upload --source mydata/ --desc "mnist is great" mnist --share my_team1

To share with multiple teams, use multiple --share arguments.

Example:

$ ngc dataset upload --source mydata/ --desc "mnist is great" mnist --share my_team1 --share my_team2

Tip

While the --share argument is optional, using the --share argument when uploading the dataset is a convenient way to make sure your datasets are shared so you don’t have to remember to share them later.

Important

Never reuse the name of a dataset because your organization will lose the ability to repeat and validate experiments.

10.2.4.2. Sharing a Dataset with your Team

You must share your dataset with your team In order for your team members to use it. If you did not use the --share argument when uploading the dataset, you can share the dataset with your team afterwards:

$ ngc dataset share --team <team_name> <dataset_id>

Example:

$ ngc dataset share --team my_team 5586

To share with your entire org, use --team ‘no-team’. Please communicate to your org admin about sharing a dataset to org, as it should be documented and published before doing so.

Example:

$ ngc dataset share --team no-team 5586
10.2.4.3. Listing Datasets

Listing existing datasets available:

$ ngc dataset list

This lists all the datasets available to the configured org and team.

Example output:

$ ngc dataset list

+-------------+------------+-------------+-------------+------------+--------+-----------+-----------+------------+-------+---------+
| Id          | Integer Id | Name        | Description | ACE        | Shared | Size      | Status    | Created    | Owned | Pre-pop |
|             |            |             |             |            |        |           |           | Date       |       |         |
+-------------+------------+-------------+-------------+------------+--------+-----------+-----------+------------+-------+---------+
| Qo-D942jRZ6 | 91107      | BraTS21     |             | nv-        | Yes    | 14.69 GB  | COMPLETED | 2021-11-11 | No    | No      |
| qMTM2MMOrvQ |            |             |             | eagledemo- |        |           |           | 00:19:22   |       |         |
|             |            |             |             | ace        |        |           |           | UTC        |       |         |

Use -h option for list command to show all context based options including --owned which is useful to list only those dataset owned by the user.

Listing Datasets Owned by you

$ ngc dataset list --owned

Listing Datasets Within a Team

$ ngc dataset list --team <teamname>
10.2.4.4. Downloading a Dataset

To download a dataset, determine the dataset ID from the NGC website, then issue the following command to download the dataset to the current folder.

$ ngc dataset download <datasetid>

To download to a specific existing folder, specify the path in the command.

$ ngc dataset download <datasetid> --dest <destpath>
10.2.4.5. Deleting a Dataset

To delete a dataset from NGC on an ACE:

$ ngc dataset remove <datasetid>

10.2.5. Importing and Exporting Datasets

Datasets can be imported and exported from S3 (Object Storage) including pre-authenticated URLs (only on OCI, today) with the NGC CLI. To do so, you must set up Secrets with specific keys.

10.2.5.1. Prerequisites
  • NGC CLI version >= 3.2x.0

  • Have a secret with the name “ngc” and the key: “ngc_api_key”

    $ ngc user secret create ngc --pair ngc_api_key:<your NGC API key>
    
  • For S3 instances:

    • Note: The following examples are for AWS, but any S3-compatible instance will work.

    • A secret with the keys: “aws_access_key_id”, “aws_secret_access_key”

      $ ngc user secret create my_aws_secret \
      --pair aws_access_key_id:<AWS_ACCESS_KEY_ID> \
      --pair aws_secret_access_key:<AWS_SECRET_ACCESS_KEY>
      
  • For Pre-Authenticated URLs (on OCI, today) :

    • A secret with the key name: “oci_preauth_url”

      $ ngc user secret create my_oci_secret \
      --pair oci_preauth_url:<Authenticated URL from OCI>
      
10.2.5.2. Importing a Dataset

You can import a dataset with the following command.

$ ngc dataset import start --protocol s3 --secret my_aws_secret --instance <instance type> --endpoint https://s3.amazonaws.com --bucket <s3 bucket name> --region <region of bucket>

----------------------------------------------------------------
Dataset Import Job Details
  Id: 1386055
  Source: s3:https://s3.amazonaws.com/<s3 bucket name>/
  Destination: resultset 1386055
  Status: QUEUED
  Start time: 2023-04-19 04:29:36 UTC
  Finish time:
  Directories found: 1
  Directories traversed: 0
  Files found: 0
  Files copied: 0
  Files skipped: 0
  Total bytes copied: 0
----------------------------------------------------------------

This will start a job with the same ID that will download the contents of the bucket into the results folder of that job.

When working with an OCI instance, the source/destination URLs do not need to be specified since the secret already contains that information. So the command will look like this:

$ ngc dataset import start --protocol url --secret my_oci_secret --instance <instance type> <dataset id>

To check on the status of a submitted job, run the following:

$ ngc dataset import info <job_id>

The job status will go from QUEUED > RUNNING > FINISHED_SUCCESS. Or it will stop at FAILED if it encounters any unrecoverable errors.

To quickly check on all import jobs use:

$ ngc dataset import list

Once the job’s status is FINISHED_SUCCESS, convert the results of that job into a new dataset with the next command:

$ ngc dataset import finish <job_id> --name <dataset_name> --desc <dataset_description>

Alternatively, copy the name, description, and sharing permissions of another dataset on the same ACE:

$ ngc dataset import finish <job_id> --from-dataset <dataset_id>
10.2.5.3. Exporting a Dataset

You can export a dataset with the following command.

$ ngc dataset export run --protocol s3 --secret my_aws_secret --instance <instance type> --endpoint https://s3.amazonaws.com/ --bucket <s3 bucket name> --region <region of bucket> <dataset_id>

----------------------------------------------------------------
Dataset Export Job Details
  Id: 1386056
  Source: dataset 515151
  Destination: s3:https://s3.amazonaws.com/<s3 bucket name>/
  Status: QUEUED
  Start time: 2023-04-20 04:23:31 UTC
  Finish time:
  Directories found: 1
  Directories traversed: 0
  Files found: 0
  Files copied: 0
  Files skipped: 0
  Total bytes copied: 0
----------------------------------------------------------------

This will start a job that copies the contents of a dataset to the target object storage.

When working with an OCI instance, the source/destination URLs do not need to be specified since the secret already contains that information. So the command will look like this:

$ ngc dataset export run --protocol url --secret my_oci_secret --instance <instance type> <dataset id>

Just like with importing datasets, export jobs can be monitored with the following command:

$ ngc dataset import list

And for detailed information about a single import job:

$ ngc dataset import info <job_id>
10.2.5.4. Building a Dataset from External Sources

Many deep learning training jobs use publicly available datasets from the internet, licensed for specific use cases. If you need to use such datasets, and they are not compatible with the above dataset import commands, NVIDIA recommends cloning the dataset into BCP storage to avoid repeatedly downloading files from external sources on every run.

To build a dataset using only BCP resources:

  1. Run an interactive job on a CPU or 1-GPU instance.

  2. Execute the commands to download and pre-process your files and put them in the Result mount.

  3. Finish the job and use Converting /result to a Dataset Using the CLI to convert the processed files from Result into a new dataset.

10.2.6. Converting a Checkpoint to a Dataset

For some workflows, such as for use with Transfer Learning Toolkit (TLT), you may need to save a checkpoint for a duration longer than that of the current project. These can then be shared with your team.

NVIDIA Base Command Platform lets you save checkpoints from a training job as a dataset for long term storage and for sharing with a team. Depending on the job configuration, checkpoints are obtained from the job /results mount or the job workspace mount.

10.2.6.1. Converting /result to a Dataset Using the NGC Web UI

Caution

This operation will remove the original files in the /result directory to create the dataset and cannot be undone.

You can convert /result to a dataset from the NGC web UI.

  1. From either the Base Command > Dashboard or Base Command > Jobs page, click the menu icon for the job containing the /result files to convert, then select Convert Results.

    _images/image22.png
  2. Enter a name and (optionally) a description in the Convert Results to Dataset dialog.

    _images/image61.png
  3. Click Convert when done.The dataset is created, which you can view from the Base Command > Datasets page.

10.2.6.2. Converting /result to a Dataset Using the CLI

Caution

This operation will remove the original files in the /result directory to create the dataset and cannot be undone.

You can convert /result to a dataset using the NGC Base Command Platform CLI as follows:

$ ngc dataset convert <new-dataset-name> --from-result <job-id>
10.2.6.3. Saving a Checkpoint from the Workspace

To save a checkpoint from your workspace, download the workspace and then upload as a dataset as follows:

  1. Download the workspace to your local disk.

    $ ngc workspace download <workspace-id> --dest <download-path>
    

    You can also specify paths within the workspace to only download the necessary files.

    $ ngc workspace download --dir path/within/workspace <workspace-id> --dest <download-path>
    

    Use the -h option to view options for specifying folders and files within the workspace for downloading. The downloaded contents will be placed in a folder labeled <workspace-id>.

  2. Upload the file(s) to a dataset.

    $ ngc dataset upload <dataset-name> --source <path-to-files>
    

    The files are uploaded to the set ACE.

10.3. Managing Workspaces

Workspaces are shareable read-write persistent storage mountable in a job for concurrent use. They are intended as a tool for read-write volumes providing scratch space between jobs or users. They have an ID and can be named. They count towards your overall storage quota.

The primary use case for a workspace is to share persistent data between jobs; for example, to use for checkpoints or for retraining.

Workspaces also provide an easy way for users in a team to work together in a shared storage space. Workspaces are a good place to store code, can easily be synced with git, or even updated while a job is running, especially an interactive job. This means you can experiment rapidly in interactive mode without uploading new containers or datasets for each code change.

10.3.1. Workspace Limitations

  • No repeatability or other production workflow guarantees, auditing, provenance, etc.

  • Read/write race conditions, with undefined write ordering.

  • File locking behavior is undefined.

  • Bandwidth and IOPS performance are limited like any shared file system.

10.3.2. Examples of Workspace Use Cases

  • Multiple jobs can write to a workspace and be monitored with TensorBoard.

  • Users can use a Workspace as a network home directory.

  • Teams can use a Workspace as a shared storage area.

  • Code can be put in a Workspace instead of the container while it’s still being iterated on and used by multiple jobs during experimentation (see dangers above)

10.3.3. Mounting Workspaces from the Web UI

Workspaces provide an easy solution for any use cases.

To mount one or more workspaces, specify the workspaces and mount points from the NGC Job Creation page when you create a new job.

  1. From the Data Input section, select the Workspaces tab and then search for a workspace to mount using the available search criteria.

  2. Select one or more workspaces from the list.

  3. Specify a unique mount point for each workspace selected.

10.3.4. Creating a Workspace

10.3.4.1. Creating a Workspace Using the Web UI
  1. Select Base Command > Workspaces from the left navigation menu, then click the Create Workspace menu on the top right corner of the page.

    _images/image53.png
  2. In the Create a Workspace dialog, enter a workspace name and select an ACE to associate with the workspace.

  3. Click Create.

    The workspace is added to the workspace list.

10.3.4.2. Creating a Workspace Using the Base Command Platform CLI

Creating a workspace involves a single command which outputs the resulting Workspace ID:

$ ngc workspace create --name <workspace-name>

Workspaces can be named for easy reference. It can be named only one time, i.e. a workspace can’t be renamed. You can name the workspace when it is created, or name it afterwards.

10.3.4.3. Using Unique Workspace Names

Since a workspace can be specified by name and id, it is imperative that those are unique across both names and ids. Workspace id is generated by the system whereas the name is specified by the user. Workspace id is always 22 chars long. In order to ensure that a user specified name does not match a future workspace id, workspace names with exactly 22 chars are not allowed.

Workspace names must follow these constraints:

  • The name cannot be 22 chars long.

  • The name must start with an alphanumeric.

  • The name can contain alphanumeric, -, or _ characters.

  • The name must be unique within the org.

These restrictions are also captured in regex ^(?![-_])(?![a-zA-Z0-9_-]{22}$)[a-zA-Z0-9_-]*$.

10.3.4.4. Naming the Workspace When it is Created
$ ngc workspace create --name ws-demo
Successfully created workspace with id: XB1Cym98QWmsX79wf0n3Lw
  Workspace Information
    ID: XB1Cym98QWmsX79wf0n3Lw
    Name: ws-demo
    Created By: John Smith
    Size: 0 B
    ACE: nv-us-west-2
    Org: nvidian
    Description:
    Shared with: None
10.3.4.5. Naming the Workspace after it is Created

Example of creating a workspace without naming it.

$ ngc workspace create

Successfully created workspace with id: s67Bcb_GQU6g75XOglOn8g

If you created a workspace without naming it, you can name it later by specifying the id and using the set -n <name> option.

$ ngc workspace set -n ws-demo s67Bcb_GQU6g75XOglOn8g -y
Workspace name for workspace with id s67Bcb_GQU6g75XOglOn8g has been set.
$ ngc workspace info ws-demo
----------------------------------------------------
  Workspace Information
    ID: s67Bcb_GQU6g75XOglOn8g
    Name: ws-demo
    ACE: nv-us-west-2
    Org: nvidian
    Description:
    Shared with: None
---------------------------------------------------

10.3.5. Listing Workspaces

You can list the workspaces you have access to, and get the details of a specific workspace:

$ ngc workspace list

+-----------------+------------+--------------+--------------+----------------+---
| Id              | Name       | Description  | ACE          | Creator        |
|                 |            |              |              | Username       |
+-----------------+------------+--------------+--------------+----------------+---
| s67Bcb_GQU6g75X | ws-demo    |              | nv-us-west-2 | Sabu Nadarajan |
| OglOn8g         |            |              |              |                |
|-----------------+------------+--------------+--------------+----------------+---


$ ngc workspace info ws-demo
----------------------------------------------------
  Workspace Information
    ID: s67Bcb_GQU6g75XOglOn8g
    Name: ws-demo
    ACE: nv-us-west-2
    Org: nvidian
    Description:
    Shared with: None
----------------------------------------------------

10.3.6. Using Workspace in a Job

Caution

Most of NVIDIA DL images already have a directory /workspace that contains NVIDIA examples. When a mount point for your workspace is specified in the job definition, take precaution that it does not conflict with the existing directory in the container. Use a directory name that is unique and does not exist in the container. In the examples below, the name of the workspace is used as the mounting point.

Access to workspace is made available in a job by specifying a mount point in the command line to run a job.

$ ngc base-command run -i nvidia/tensorflow:18.10-py3 -in dgx1v.16g.1.norm --ace
                    nv-us-west-2 -n HowTo-workspace --result /result --commandline 'sleep
                    5h'
----------------------------------------------------
 Job Information
 Id: 223282
 Name: HowTo-workspace
...
 Datasets, Workspaces and Results
 Dataset ID: 8181
 Dataset Mount Point: /dataset
 Workspace ID: s67Bcb_GQU6g75XOglOn8g
 Workspace Mount Point: /ws-demo
 Workspace Mount Mode: RW
 Result Mount Point: /result
...
----------------------------------------------------

A workspace is mounted in Read-Write (RW) mode by default. Mounting in Read-Only (RO) mode is also supported. In RO mode, it functions similarly to a dataset.

$ ngc base-command run -i nvidia/tensorflow:18.10-py3 -in dgx1v.16g.1.norm --ace
                    nv-us-west-2 -n HowTo-workspace --result /result --commandline 'sleep 5h'
                    --datasetid 8181:/dataset --workspace ws-demo:/ws-demo:RO
----------------------------------------------------
 Job Information
 Id: 223283
 Name: HowTo-workspace
...
 Datasets, Workspaces and Results
 Dataset ID: 8181
 Dataset Mount Point: /dataset
 Workspace ID: s67Bcb_GQU6g75XOglOn8g
 Workspace Mount Point: /ws-demo
 Workspace Mount Mode: RO
 Result Mount Point: /result

...
----------------------------------------------------

Specifying a workspace in a job using a JSON file is shown below. The example below is derived from the first job definition shown in this section.

{
 "aceId": 357,
 "aceInstance": "dgxa100.40g.1.norm",
 "aceName": "nv-eagledemo-ace",
 "command": "sleep 5h",
 "datasetMounts": [
 {
 "containerMountPoint": "/dataset",
 "id": 8181
 }
 ],
 "dockerImageName": "nvidia/tensorflow:18.10-py3",
 "name": "HowTo-workspace",
 "resultContainerMountPoint": "/result",
 "runPolicy": {
 "preemptClass": "RUNONCE"
 },
 "workspaceMounts": [
 {
 {
 "containerMountPoint": "/ws-demo",
 "id": "ws-demo",
 "mountMode": "RW"
 }
 ]
}

10.3.7. Accessing Workspaces Using SFTP

Secure File Transfer Protocol (SFTP) is a commonly used network protocol for secure data access and transfer to and from network-accessible storage. Base Command Platform Workspaces interoperate with SFTP-compliant tools to provide a standard and secure access method to storage in a BCP environment.

NGC CLI can be used to query a workspace and expose the port, hostname, and token to be used with SFTP clients. Running ngc base-command workspace info with the --show-sftp flag will return all information necessary to communicate with the workspace via SFTP, along with a sample command for using the sftp CLI tool.

$ ngc base-command workspace info X7xHfMZISZOfUbKKtGnMng --show-sftp
-------------------------------------------------------------------------------
 Workspace Information
   ID: X7xHfMZISZOfUbKKtGnMng
   Name: sftp-test
   Created By: user@company.com
   Size: 0 B
   ACE: example-ace
   Org: nvidia
   Description: My workspace for using SFTP to move data
   Shared with:
-------------------------------------------------------------------------------
 SFTP Information
   Hostname: example-ace.dss.stg-ace.ngc.nvidia.com
   Port: 443
   Token: ABCDEFGHIJBObk5sWVhBemNXZzBOM05tY2pkMFptSTNiRzFsWVhVME9qQmpOamMzTWpFNExUaGlZVEV0TkRkbU1pMDVZakUzTFdZME9USTVORGN4TVRnMk5BLCwsWDd4SGZNWklTWk9mVWJLS3RHbk1uZywsLG52aWRpYQ==
   Example: sftp -P<Port> <Token>@<Hostname>:/
-------------------------------------------------------------------------------
10.3.7.1. Connecting to a Workspace Using the SFTP Tool

The sftp tool available for Linux, WSL, and MacOS shells can be used with the example provided in the NGC CLI output above. Using sftp with the previous example’s output follows.

sftp -P443 ABCDEFGHIJBObk5sWVhBemNXZzBOM05tY2pkMFptSTNiRzFsWVhVME9qQmpOamMzTWpFNExUaGlZVEV0TkRkbU1pMDVZakUzTFdZME9USTVORGN4TVRnMk5BLCwsWDd4SGZNWklTWk9mVWJLS3RHbk1uZywsLG52aWRpYQ==@example-ace.dss.stg-ace.ngc.nvidia.com:/
Connected to example-ace.dss.stg-ace.ngc.nvidia.com.
Changing to: /
sftp>

The commands supported by sftp can be viewed by entering ? at the prompt:

sftp> ?
Available commands:
bye                                  Quit sftp
cd path                         Change remote directory to 'path'
chgrp grp path                  Change group of file 'path' to 'grp'
chmod mode path                 Change permissions of file 'path' to 'mode'
chown own path                  Change owner of file 'path' to 'own'
df [-hi] [path]                 Display statistics for current directory or
                                filesystem containing 'path'
exit                            Quit sftp
get [-afPpRr] remote [local]    Download file
reget [-fPpRr] remote [local]   Resume download file
reput [-fPpRr] [local] remote   Resume upload file
help                            Display this help text
lcd path                        Change local directory to 'path'
lls [ls-options [path]]         Display local directory listing
lmkdir path                     Create local directory
ln [-s] oldpath newpath         Link remote file (-s for symlink)
lpwd                            Print local working directory
ls [-1afhlnrSt] [path]          Display remote directory listing
lumask umask                    Set local umask to 'umask'
mkdir path                      Create remote directory
progress                        Toggle display of progress meter
put [-afPpRr] local [remote]    Upload file
pwd                             Display remote working directory
quit                            Quit sftp
rename oldpath newpath          Rename remote file
rm path                         Delete remote file
rmdir path                      Remove remote directory
symlink oldpath newpath         Symlink remote file
version                         Show SFTP version
!command                        Execute 'command' in local shell
!                               Escape to local shell
?                               Synonym for help

The following is an example of using the put command.

sftp> put large-file
Uploading large-file to /large-file
large-file                                     16% 2885MB  21.9MB/s   11:07 ETA

When finished using sftp, end the active session with either the bye, quit, or exit command:

sftp> bye
10.3.7.2. Connecting to a Workspace Using WinSCP

WinSCP is a common SFTP application used for SFTP file transfers in the Windows operating system. Once WinSCP has been downloaded and installed to a user’s workstation, the same data used with the sftp CLI tool can be populated into the WinSCP user interface. Switch the file protocol to SFTP, and populate the host name and port number. Do not populate the user name or password. Click Login to proceed.

Connecting Workspace Using WinSCP

The user interface will prompt for a user name value - paste the token from the workspace’s NGC CLI output and click OK.

Connecting Workspace Using WinSCP Username

The local file system and workspace contents will now be visible side by side. Users can now drag and drop files between the two file systems as necessary.

Connecting Workspace Using WinSCP Filesystem

10.3.8. Bulk File Transfers for Workspaces

10.3.8.1. Uploading and Downloading Workspaces

Mounting a workspace to access or transfer a few files works great. If you need to do a bulk transfer of many files like populating an empty workspace at beginning or downloading an entire workspace for archiving, workspace upload and download commands work better.

Uploading a directory to workspace is similar to uploading files to a dataset.

$ ngc workspace upload --source ngc140
                  s67Bcb_GQU6g75XOglOn8g
Total number of files is 6459.
Uploaded 170.5 MB, 6459/6459 files in 9s, Avg Upload speed: 18.82 MB/s, Curr
                    Upload Speed: 25.9 KB/s
----------------------------------------------------
Workspace: s67Bcb_GQU6g75XOglOn8g Upload: Completed.
Imported local path (workspace): /home/ngccli/ngc140
Files transferred: 6459
Total Bytes transferred: 178777265 B
Started at: 2018-11-17 18:26:33.399256
Completed at: 2018-11-17 18:26:43.148319/
Duration taken: 9.749063 seconds
----------------------------------------------------

Downloading workspace to a local directory is similar to downloading results from a job.

$ ngc workspace download --dest temp s67Bcb_GQU6g75XOglOn8g
Downloaded 56.68 MB in 41s, Download speed: 1.38 MB/s
----------------------------------------------------
Transfer id: s67Bcb_GQU6g75XOglOn8g Download status: Completed.
Downloaded local path: /home/ngccli/temp/s67Bcb_GQU6g75XOglOn8g
Total files downloaded: 6459
Total downloaded size: 56.68 MB
Started at: 2018-11-17 18:31:03.530342
Completed at: 2018-11-17 18:31:45.592230
Duration taken: 42s seconds
----------------------------------------------------
10.3.8.2. Exporting Workspaces

Workspaces can also be exported directly to S3 and OCI instances. Refer to Importing and Exporting Datasets for details about the prerequisites for exporting datasets.

The following command will export all the files in a given workspace to an s3 bucket in AWS:

$ ngc workspace export run --protocol s3 --secret my_aws_secret \
--instance <instance type> --endpoint https://s3.amazonaws.com \
--bucket <s3 bucket name> --region <region of bucket> <workspace_id>

To export a workspace to an OCI storage instance, use the following arguments:

$ ngc workspace export run --protocol url --secret my_oci_secret --instance <instance type> <workspace_id>

Similar to exporting datasets, you can check on the status of the export job with the following:

$ ngc workspace export info <job_id>

Or check on all past and current workspace export jobs with the following:

$ ngc workspace export list

10.3.9. Workspace Sharing and Revoking Sharing

Workspaces can be shared with a team or with the entire org.

Important

Each workspace is private to the user who creates it until you decide to share with your team. Once you share with your team, all team members have the same rights in that workspace, so have a sharing protocol before you share. For instance one way of using a workspace is to have a common area which only the owner updates, and multiple user directories, one per user where each user can write their own data.

Sharing a workspace with a team:

$ ngc workspace info ws-demo
----------------------------------------------------
 Workspace Information
 ID: s67Bcb_GQU6g75XOglOn8g
 Name: ws-demo
 ACE: nv-us-west-2
 Org: nvidian
 Description:
 Shared with: None
----------------------------------------------------
$ ngc workspace share --team nves -y ws-demo
Workspace successfully shared
$ ngc workspace info ws-demo
----------------------------------------------------
 Workspace Information
 ID: s67Bcb_GQU6g75XOglOn8g
 Name: ws-demo
 ACE: nv-us-west-2
 Org: nvidian
 Description:
 Shared with: nvidian/nves
----------------------------------------------------

Revoking a shared workspace:

$ ngc workspace revoke-share --team nves -y ws-demo
Workspace share successfully revoked
$ ngc workspace info ws-demo
----------------------------------------------------
 Workspace Information
 ID: s67Bcb_GQU6g75XOglOn8g
 Name: ws-demo
 ACE: nv-us-west-2
 Org: nvidian
 Description:
 Shared with: None
----------------------------------------------------

10.3.10. Removing Workspaces

10.3.10.1. Using the Web UI

You can remove an unshared workspace using the Web UI:

  1. Select Base Command > Workspaces from the left navigation menu and click on a workspace from the list.

  2. Click the vertical ellipsis menu on the top right corner of the page and select Delete Workspace.

    _images/workspace-delete.png

Shared workspaces are not removable using the Web UI. The following example shows the Delete Workspace command is disabled for a workspace shared with the nv-test team.

_images/workspace-shared-removing.png
10.3.10.2. Using the CLI

Removing an unshared workspace involves a single command:

$ ngc workspace remove ws-demo

Are you sure you would like to remove the workspace with ID or name: 'ws-demo' from org: '<org_name>'? [y/n]y
Successfully removed workspace with ID or name: 'ws-demo' from org: '<org_name>'.

Shared workspaces are not removable using the CLI. You will see the following message if you attempt to remove a shared workspace:

$ ngc workspace remove test-shared-workspace

Are you sure you would like to remove the workspace with ID or name: 'test-shared-workspace' from org: '<org_name>'? [y/n]y
Removing of workspace with ID or name: 'test-shared-workspace' failed: Client Error: 422
Response: Workspace '<workspace_id>' can't be deleted while it is shared.
It is shared with: <org_name/team_name> - Request Id: None. Url: <workspace_url>.

10.4. Managing Results

A job result consists of a joblog.log file and all other files written to the result mount. In the case of multi-node jobs, each node is allocated a unique result mount and joblog.log file. Consequently, result mounts are not suitable for synchronization across nodes.

10.4.1. joblog.log

For jobs run with array-type “MPI,” the output of STDOUT and STDERR is consolidated into the joblog.log file within the result directory. In the case of a multi-node job, the default behavior is to stream the output of STDOUT and STDERR from all nodes to the joblog.log file on the first node (replica 0). As a result, the remaining log files on the other nodes will be empty.

For jobs run with array-type “PYTORCH,” the output of STDOUT and STDERR will be written to separate per-node, per-rank files in the job’s result directory. For example, STDOUT and STDERR for node 0 rank 0 will be written to /result/node_0_local_rank_0_stdout, /result/node_0_local_rank_0_stderr, respectively. The joblog.log for each worker node will then contain aggregated logs of the following format, containing the log content from the per-node, per-rank files:

{"date":"DATE_TIMESTAMP","file":"FILE_NAME","log":"LOG_FROM_FILE"}

These job logs can be viewed in the NGC Web UI. See Monitoring Console Logs (joblog.log) for instructions on how to do so.

10.4.2. ​ Downloading a Result

To download the result of a Job, use the following command:

$ ngc result download <job-id>

For multi-node jobs, this command will retrieve the results for the first node/replica. To obtain the results for other nodes, you need to specify the replica ID as follows:

$ ngc result download <job-id>:<replica-id>

The content is downloaded to a folder named <job-id >. In the case of multi-node jobs, if a replica ID is specified, the folder will be named <job-id >_<replica-id >.

10.4.3. Removing a Result

Results will continue to occupy the system quota until you remove them. To remove the results, use the following command:

$ ngc result remove <job-id>

10.4.4. ​Converting Results into Datasets

If you wish to convert the results into a dataset, follow these steps:

  1. Select Jobs from the left-hand navigation.

  2. Locate the job from which you want to convert the results and click on the menu icon.

  3. Select Convert Results to Dataset.

  4. In the Convert Results to Dataset dialog box, provide a name and description for your dataset.

  5. Click Convert to initiate the conversion process.

  6. Once the conversion is complete, your dataset will appear on the Dataset page.

Remember to share your dataset with others in your team or org by following the instructions in Sharing a Dataset with your Team.

10.5. Local Scratch Space (/raid)

All Base Command Platform nodes come with several SSD drives configured as a RAID-0 array cache storage. This scratch space is mounted in every full-node job at /raid.

A typical use of this /raid scratch space can be to store temporary results/checkpoints that are not required to be available after a job is finished or killed. Using this local storage for intermediate results/logs will avoid heavy network storage access (such as results and workspaces) and should improve job performance. The data on this scratch space is cleared (and not automatically saved/backed-up to any other persistent storage) after a job is finished. Consider /raid to be a temporary scratch space available during the lifetime of the job.

Since the /raid volume is local to a node, the data in it is not backed-up and transferred when a job is preempted and resumed. It is the responsibility of the job/user to periodically backup the required checkpoint data to the available network storage (results or workspaces) to enable resuming a job (which is almost certainly on a different node) after a preemption.

Example Use Case: Copying a mounted dataset to /raid to remove network latency.

… -commandline "cp -r /mount/data/ /raid ; bash train.sh /raid/"

This works well for jobs with many epochs using datasets that are reasonable in size to replicate to local storage. Note that contents of /raid volume are not carried over to the new node when a job is preempted and resumed and that the required info must be saved in an available network storage space for resuming the job using the data.

11. Jobs and GPU Instances

This chapter describes Base Command Platform features for submitting jobs to the GPU instances, and for managing and interacting with the jobs. In this chapter, you will learn how to identify GPU instances and their attributes available to you, how to define jobs to associated storage entities, and how to manage the jobs using either the Web UI or the CLI.

11.1. Quick Start Jobs

The Quick Start feature of Base Command Platform provides a simplified option for launching interactive jobs.

Using Quick Start, administrators can create templates with pre-selected ACES/compute instances, containers, workspaces, datasets, and more.

Users can easily launch these templates through the Web UI or the CLI, removing the requirement to configure individual jobs, and providing them quick and easy access to launch pre-configured jobs with an interactive JupyterLab session.

There are two Quick Start templates created by default:

  • JupyterLab - This simple template creates a single-node job that launches JupyterLab from within a specified container. By default, either PyTorch or TensorFlow base containers can be used.

  • Dask & RAPIDS - This template launches a more complex multi-node MPI job using a RAPIDS container and initiates a cluster of Dask workers on these nodes. JupyterLab is launched as the interaction point for this cluster.

See the sections below for how to launch jobs using these templates.

Important

Security Note: Launching a Quick Start Job will create a URL to access JupyterLab that ANYONE CAN USE. For more details and security recommendations, refer to the note in NVIDIA Base Command Platform Terms and Concepts.

11.1.1. JupyterLab Quick Start Jobs (Single-node)

The following shows how to launch a JupyterLab job using the Quick Start feature as a Base Command Platform User.

11.1.1.1. Using the NGC Web UI
  1. From the Base Command Platform Dashboard, click Launch on the JupyterLab card under the Quick Start header.

    Launch JupyterLab from Dashboard

    Details of the type of job to be launched are shown across the bottom of the card. From left to right, you can see:

    • The number of GPUs available for the job upon launch

    • The container used by the environment

    • The number of datasets mounted to the container and whether a workspace has been selected for use in the job.

      Note

      If you don’t select a Workspace, a custom workspace will automatically be created when you launch the job.

  2. After launching the job, you will be taken to the job page, where you can see the job details, including the number of GPUs allocated and the available memory for your job. When the JupyterLab instance is ready, the status will read ‘RUNNING’, and the Launch JupyterLab button in the top right will turn green.

  3. Click Launch JupyterLab in the top right corner of the page. A JupyterLab environment running inside the container listed on the card will be launched in a new tab.

    Launch JupyterLab

Note

The default run time for jobs launched through Quick Start is 60 minutes.

There are many ways to modify the Quick Start job before launch. You can specify a different workspace, add or remove datasets, change the container the job will use, and select a different ACE.

11.1.1.2. Using the NGC CLI

The NGC CLI supports creating and managing Quick Start Jobs via the following command:

$ ngc base-command quickstart cluster

You can launch a JupyterLab job using the Quick Start CLI with the following command syntax:

$ ngc base-command quickstart cluster create --name <cluster name> --ace <ace name> --cluster-lifetime 3600s \
--cluster-type jupyterlab --container-image <container image> --data-output-mount-point /results \
--scheduler-instance-type <instance type> --job-order 50 --job-priority NORMAL --min-time-slice 0s \
--nworkers 1 --org <org> --label quick_start_jupyterlab --workspace-mount <workspace>

Example: To launch a JupyterLab job:

$ ngc base-command quickstart cluster create --name "Quick Start jupyterlab tensorflow ffb4a" \
--ace ceph-sjc-4-ngc-wfs0 --cluster-lifetime 3600s --cluster-type jupyterlab \
--container-image "nvidia/tensorflow:23.08-tf2-py3" --data-output-mount-point /results \
--scheduler-instance-type dgx1v.32g.4.norm --job-order 50 --job-priority NORMAL \
--min-time-slice 0s --nworkers 1 --org nvidia --label quick_start_jupyterlab \
--workspace-mount ZNqskFA0SC2uMGUa4q-5Vg:/bcp/workspaces/49529_quick-start-jupyterlab-workspace_ceph-sjc-4-ngc-wfs0:RW

To see a complete list of options for the cluster create command, issue the following:

$ ngc base-command quickstart cluster create -h

For more information on the Quick Start ‘cluster’ command, refer to the NGC CLI documentation.

11.1.2. Dask and RAPIDS Quick Start Jobs (Multi-node)

All clusters have a Dask & RAPIDS Quick Start launch enabled by default. (However, this may have been disabled by your account admin.) The RAPIDS libraries provide a range of open-source GPU-accelerated Data Science libraries. For more information, refer to RAPIDS Documentation and Resources. Dask allows you to scale out workloads across multiple GPUs. For more information, refer to the documentation on Dask. When used together, Dask and RAPIDS allow you to scale your workloads both up and out.

11.1.2.1. Using the NGC Web UI
  1. From the Base Command Platform Dashboard, click Launch on the Dask & RAPIDS card under the Quick Start header.

    Quick Start Dask and Rapids

    The job will be launched with the number of GPUs (per node), Dask workers, and container images shown on the card. Upon launch, the job will create a workspace that will be used in the job.

  2. After launching the job, you will be taken to the job page, where you can see the job details, including the number of GPUs allocated and the amount of memory available for your job. When the JupyterLab instance is ready, the status will read ‘RUNNING’, and the Launch JupyterLab button in the top right will turn green.

    Note

    This may take up to 10 minutes to be ready.

  3. Click Launch JupyterLab in the top right corner of the page. A JupyterLab environment running inside the Dask & RAPIDS container will be launched in a new tab.

11.1.2.2. Using the NGC CLI

The NGC CLI supports creating and managing Quick Start Jobs via the following command:

$ ngc base-command quickstart cluster

You can launch a Dask and RAPIDS JupyterLab job using the Quick Start CLI with the following command syntax:

$ ngc base-command quickstart cluster create --name <cluster name> --ace <ace name> \
--cluster-lifetime 3600s --cluster-type dask --container-image <container image> \
--data-output-mount-point /results --scheduler-instance-type <instance type> --job-order 50 \
--job-priority NORMAL --min-time-slice 0s --nworkers 1 --org <org> --label quick_start_dask \
--workspace-mount <workspace>

Example: To launch a Dask and RAPIDS JupyterLab job:

$ ngc base-command quickstart cluster create --name "Quick Start dask rapidsai-core b3f45" \
--ace ceph-sjc-4-ngc-wfs0 --cluster-lifetime 3600s --cluster-type dask \
--container-image "nvidia/rapidsai-core:cuda11.8-runtime-ubuntu22.04-py3.10" \
--data-output-mount-point /results --scheduler-instance-type dgx1v.32g.8.norm \
--worker-instance-type dgx1v.32g.8.norm --job-order 50 --job-priority NORMAL \
--min-time-slice 0s --nworkers 2 --org nvidia --preempt-class RUNONCE --label quick_start_dask \
--workspace-mount XaoQAFeTQKui6nB0Fr_J7A:/bcp/workspaces/49529_quick-start-dask-workspace_ceph-sjc-4-ngc-wfs0:RW

To see a complete list of options for the cluster create command, issue the following:

$ ngc base-command quickstart cluster create -h

For more information on the Quick Start ‘cluster’ command, refer to the NGC CLI documentation.

11.1.3. Customizing your Workspace and Datasets for a Quick Start Job

If necessary, additional datasets and workspaces than what were configured in the template can be mounted to your Quick Start Job so you can access your own data and specify your individual workspace to launch your job in.

Note

This customization is temporary and will not be saved if you navigate away from the dashboard. For permanent changes, work with your Base Command administrator to create a template for the Quick Start Job.

  1. From the Base Command Platform Dashboard, click the dataset and workspace indicator, (in this example, 0 DS / 0 WS) on the JupyterLab Quick Start card you wish to use. The Data Input page will open.

    JupyterLab Dataset and Workspace
  2. From the Data Input page, select any Datasets and/or a Workspace you wish to use with your Quick Start job. You can also specify a Mount Point for your Datasets.

    Once you have made your selection, click Save Changes at the bottom of the page.

    Data Input

    The DS / WS count on the Quick Start card will now be updated to show the number of Datasets and Workspaces selected. For example, the card below shows that we selected two datasets and one workspace.

    JupyterLab Dataset and Workspace Count
  3. Click Launch. The job will use the workspace selected (or create a default if no Workspace was chosen) and mount any chosen datasets to the corresponding Mount Point.

    Once the job has been created, you will be taken to the job page, where you can see details, including the number of GPUs allocated and the available memory for your job. When the JupyterLab instance is ready, the status will read ‘RUNNING’, and the Launch JupyterLab button in the top right will turn green.

  4. Click Launch JupyterLab in the top right of the job page once it turns from grey to green. A JupyterLab environment running inside the container listed on the card will be launched in a new tab.

11.1.3.1. Customizing Number of Workers for Dask and RAPIDS Quick Start Job

The default Dask & RAPIDS Quick Start job is launched with a cluster of 14 workers.

These workers are Dask workers, each consuming a GPU of a node (replica). By default, two GPUs are used for Jupyterlab and Dask Scheduler (one each), and the Dask workers use 14 (one each), for a total of 16 GPUs. As a result, this default job will span two nodes (assuming eight GPUs per node/replica). Every additional node will support up to eight workers. For example, 15-22 workers will use three nodes, and 23-30 workers will use four.

To change the number of Dask workers for the job:

  1. From the Base Command Platform Dashboard, click Workers along the bottom of the Dask & RAPIDS Quick Start card.

  2. Use the + and - buttons to select the number of Dask workers you wish to use. Once selected, click Save Changes.

    Choose Workers
  3. The Quick Start card will display the updated number of workers. Click Launch to launch the job.

11.1.4. Launching a Quick Start Job from a Template

Templates can be made available to users by the Organization Administrator. These allow users to quickly launch Quick Start environments with different defaults for ACE, container, datasets, and workspace mounts.

  1. From the Base Command Platform Dashboard, click the vertical ellipses in the top right corner of the Quick Start Job you’d like to run, and select Launch from Templates.

    Quick Start on the Dashboard
  2. In the window, you will see a list of templates available to you, including details about the Container, Data Inputs, and Computing Resources used for each template. Select the template you wish to use, then click Launch with Template to launch a JupyterLab Quick Start from that template.

    Quick Start Templates

    You will be taken to the job page once it has been created. When ready, you can click Launch JupyterLab in the top right corner.

    Note

    Only platform administrators can create new templates and make them available to Base Command Platform Users. For details on how to create a new template, see the instructions below.

11.1.5. Launching a Custom Quick Start Job

Custom Quick Start Jobs allow you to launch a job using either template, while specifying an ACE and a launch Container, and any additional ports you wish to expose.

  1. From the Base Command Platform dashboard, for the Quick Start template you wish to start from, click the vertical ellipses in the top right corner of the template and select Custom Launch.

  2. You will be guided through a multi-step Custom Launch menu. To move to the next stage, click the green ‘Next’ button in the bottom right corner.

    1. First, select an ACE. Once you choose an ACE, the associated instances will be displayed. Select the instance you wish to use.

    2. Next, if using the Dask & RAPIDS (multi-node) template, you will be prompted to select the number of workers. This step will not be present in the JupyterLab (single-node) template.

    3. Next, you can select a container and protocol. Use the drop-down menu to choose a container. You must also select a container tag.

      Note

      Only containers listed as ‘Quick Start Validated’ have been tested to work with the Quick Start custom launch. You may select a different container; however, it may result in the failure of your job. We validate the penultimate release of the containers. To use the latest containers, we recommend you launch a custom job.

      Custom Launch

      You can also select a protocol and container port to expose from within the running job. When using the Quick Start Validated containers, you should not expose port 8888 for JupyterLab as this is automatically exposed.

    4. Next, select any datasets you wish to mount within your container and a workspace you want to use.

  3. Click Launch JupyterLab to launch the job.

    Important

    Security Note: When opening a port to the container, it will create a URL that ANYONE CAN USE. For more details and security recommendations, refer to the note in NVIDIA Base Command Platform Terms and Concepts. To launch a secure job, follow the instructions for Running a Simple Job.

11.1.6. Creating New Quick Start Templates

This section is for administrators (with an org-level BASE_COMMAND_ADMIN role) and describes the process for creating and activating templates for NVIDIA Base Command Platform users.

11.1.6.1. Using the NGC Web UI
  1. From the Base Command Platform Dashboard, click the vertical ellipses in the top right corner of any existing Quick Start card. Click Launch From Templates.

    Launch from templates
  2. Click + Create New Template in the top left of the menu.

    Create new template
  3. You will be guided through a multi-step Create New Template menu. To move to the next stage, click the green ‘Next’ button in the bottom right corner.

    1. First, select an ACE. Once you choose an ACE, the associated instances will be displayed. Select the instance you wish to use.

    2. Next, if using the Dask & RAPIDS (multi-node) template, you will be prompted to select the number of workers. This step will not be present in the JupyterLab (single-node) template.

    3. Next, select a container and (optionally) a protocol. Use the drop-down menu to select a container. You must also select a container tag.

      Note

      Only containers listed as ‘Quick Start Validated’ have been tested to work with the Quick Start custom launch. You may select a different container; however, it may result in the failure of your job. We validate the penultimate release of the containers. To use the latest containers, we recommend you launch a custom job.

      Select Container and Protocol
    4. Next, select any datasets you wish to mount within the container and a workspace you may wish to use (if applicable).

  4. Click Create JupyterLab template.

    This template will now be available to users and can be found in the list of templates under the Launch From Templates menu, accessed from the vertical ellipses in the top right corner of the Quick Start card.

11.1.6.2. Using the NGC CLI

The NGC CLI supports creating and managing Quick Start Templates via the following command:

$ ngc base-command quickstart project

You can create a JupyterLab template using the Quick Start CLI with the following command syntax:

$ ngc base-command quickstart project create-template \
         --name <template name> \
         --description <template description> \
         --display-image-url <template image URL> \
         --ace <ace name> \
         --cluster-lifetime 3600s \
         --cluster-type jupyterlab \
         --container-image <container image> \
         --data-output-mount-point /results \
         --scheduler-instance-type <instance type> \
         --job-order 50 \
         --job-priority NORMAL \
         --min-time-slice 1s \
         --nworkers 2 \
         --org <org name> \
         --label <job labels> \
         --workspace-mount <workspace mountpoint>

Example: To create a TensorFlow JupyterLab template:

$ ngc base-command quickstart project create-template \
            --name "demo tensorflow template" \
            --description "demo" \
            --display-image-url "https://demo/demo-image.png" \
            --ace ceph-sjc-4-ngc-wfs0 \
            --cluster-lifetime 3600s \
            --cluster-type jupyterlab \
            --container-image "nvidia/tensorflow:23.08-tf2-py3" \
            --data-output-mount-point /results \
            --scheduler-instance-typedgx1v.32g.4.norm \
            --job-order 50 \
            --job-priority NORMAL \
            --min-time-slice 1s \
            --nworkers 2 \
            --org nvidia \
            --label "tf template" \
            --workspace-mount ZNqskFA0SC2uMGUa4q-5Vg:/bcp/workspaces/49529_quick-start-jupyterlab-workspace_ceph-sjc-4-ngc-wfs0:RW

Example: To create a PyTorch Jupyter template:

$ ngc base-command quickstart project create-template \
            --name "demo pytorch template" \
            --description "demo" \
            --display-image-url "https://demo/demo-image.png" \
            --ace ceph-sjc-4-ngc-wfs0 \
            --cluster-lifetime 3600s \
            --cluster-type jupyterlab \
            --container-image "nvidia/pytorch:23.08-py3" \
            --data-output-mount-point /results \
            --scheduler-instance-typedgx1v.32g.4.norm \
            --job-order 50 \
            --job-priority NORMAL \
            --min-time-slice 1s \
            --nworkers 2 \
            --org nvidia \
            --label "tf template" \
            --workspace-mount ZNqskFA0SC2uMGUa4q-5Vg:/bcp/workspaces/49529_quick-start-jupyterlab-workspace_ceph-sjc-4-ngc-wfs0:RW

To see a complete list of options for the template command, issue the following:

$ ngc base-command quickstart project -h

For more information on the Quick Start ‘project’ command, refer to the NGC CLI documentation.

11.1.7. Changing Default Quick Start Templates

This section is for administrators (with an org-level BASE_COMMAND_ADMIN role) and describes the process for changing the default template for each Quick Start Job card that’s shown on the Base Command Platform Dashboard.

  1. From the Base Command Platform Dashboard, click the vertical ellipses in the top right corner of any existing Quick Start card. Click Launch From Templates.

  2. Click on the vertical ellipses on the right-hand side of the template you wish to set as default.

    Launch JupyterLab from Templates - Set as Default Template
  3. Click Set as Default Template. The default will be updated for all users upon refreshing the dashboard.

11.1.8. Updating Existing Quick Start Templates

This section is for administrators (with an org-level BASE_COMMAND_ADMIN role) and describes the process for updating templates for users of the NVIDIA Base Command Platform.

It is possible to update existing Quick Start templates, available for users to select as additional launch options as described in Launching a Quick Start Job from a Template.

  1. From the Base Command Platform Dashboard, click the vertical ellipses in the top right corner of any existing Quick Start card. Click Launch From Templates.

  2. Click on the vertical ellipses on the right-hand side of the template you wish to edit.

    Launch JupyterLab from Templates - Edit Template
  3. Click Edit Template. Follow the steps, similar to Creating New Quick Start Templates Using the NGC Web UI, to update the existing template.

11.2. Running a Simple Job

The section describes how to run a simple “Hello world” job.

  1. Login to the NGC portal and click BASE COMMAND > Jobs from the left navigation menu.

    _images/jobs-nav.png
  2. In the upper right select Create Job.

  3. Select your Accelerated Computing Environment and Instance type from the ACE dropdown menu.

    _images/create-job-ace.png
  4. Under Data Output, choose a mount point to access results.

    The mount point can be any path that isn’t already in the container. The result mount point is typically /result or /results.

    _images/result-mount-point.png
  5. Under the Container Selection area:

    1. Select a container image and tag from the dropdown menus, such as nvidia/tensorflow:22.12-tf1-py3

    2. Enter a bash command under Run Command; for example, echo 'Hello from NVIDIA'.

    _images/container-selection.png
  6. At the bottom of the screen, enter a name for your job.

    You may optionally add a custom label for your job.

    _images/launch-job-custom-label.png
  7. Click Launch Job in the top right corner of the page.

    Alternatively, click the copy icon in the command box and then paste the command into the command line if you have NGC CLI installed.

  8. After launching the job, you will be taken to the jobs page and see your new job at the top of the list in either a Queued or Starting state.

    _images/job-starting.png
  9. This job will run the command (the output can be viewed in the Log tab). The Status History tab reports the following progress with the timestamps: Created -> Queued -> Starting -> Running -> Finish.

    _images/status-history.png

11.3. Running JupyterLab in a Job

This section describes how to run a simple ‘Hello world’ job incorporating JupyterLab.

NGC containers include JupyterLab within the container image. Using JupyterLab is a convenient way to run notebooks, get shell access (multiple sessions), run tensorboard, and have a file browser and text editor with syntax coloring all in one browser window. Running it in the background in your job is non-intrusive without any additional performance impact or effort and provides you an easy option to peek into your job at any time.

Important

Security Note: When opening a port to the container it will create an URL that ANYONE CAN USE. For more details and security recommendations, refer to the note in NVIDIA Base Command Platform Terms and Concepts.

11.3.1. Example of Running JupyterLab in a Job

The following is an example of a job that takes advantage of JupyterLab.

$ ngc base-command run --name "jupyterlab" --instance <INSTANCE_NAME> \
--commandline "jupyter lab --ip=0.0.0.0 --allow-root --no-browser --NotebookApp.token='' \
--notebook-dir=/ --NotebookApp.allow_origin='*'" \
--result /result --image "nvidia/pytorch:23.01-py3" --port 8888

These are some key aspects to using JupyterLab in your job.

  • Specify --port 8888 in the job definition.

    The Jupyter lab port (8888 by default) must be exposed by the job.

  • The JupyterLab command must begin with the ‘jupyter lab’.

  • Total runtime should be set to a reasonable number to access the container before it finishes the job and closes

11.3.2. Connecting to JupyterLab

While the job is in a running state, you can connect to JupyterLab through the mapped URL as follows.

  • From the website, click the URL presented in the Mapped Port section of the job details page.

  • From the CLI, run $ ngc base-command info <job-id > and then copy the URL in the Port Mappings line and paste into a browser.

Example of JupyterLab:

_images/image36.png

11.4. Cloning an Existing Job

You can clone jobs, which is useful when you want to start with an existing job and make small changes for a new job.

  1. Click Jobs from the left navigation menu, then click the ellipsis menu for the job you want to copy and select Clone Job from the menu.

    _images/clone-job.png

    The create a job page opens with the fields populated with the information from the cloned job.

  2. Edit fields as needed to create a new job, enter a unique name in the Name field, then click Launch.

    The job should appear in the job dashboard.

To clone jobs via the CLI, use the --clone flag and add other flags to override any parameters being copied from the original job.

$ ngc base-command run --clone <job-id> --instance dgx1v.32g.8.norm

11.5. Launching a Job from a Template File

  1. Click Base Command >Jobs > Create from the left-side menu and then click Create From Templates from the ribbon menu.

    _images/image27.png
  2. Click the menu icon for the template to use, then select Apply Template.

    _images/image21.png

    The create a job page opens with the fields populated with the information from the job template.

  3. Edit fields as needed to create a new job or leave the fields as is, then click Launch.

11.6. Launching a Job Using a JSON File

When running jobs repeatedly from the CLI, sometimes it is easier to use a template file than the command line flags. This is currently supported in JSON. The following sections describe how to generate a JSON file from a job template and how to use it in the CLI.

11.6.1. Generating the JSON Using the Web UI

Perform the following to generate a JSON file using the NGC web UI.

  1. Click Dashboard from the left-side menu, click the table view icon next to the search bar, then click the menu icon for the job you want to copy and select Copy to JSON.

    The JSON is copied to your clipboard.

  2. Open a blank text file, paste the contents into the file and then save the file using the extension .json.

    Example: test-json.json

  3. To run a job from the file, issue the following:

    $ ngc base-command run -f <file.json>
    

11.6.2. Generating the JSON Using the CLI

Alternatively, you can get the JSON using the CLI if you know the job ID as follows:

$ ngc base-command get-json <job-id> > <path-to-json-file>

The JSON is copied to the specified path and file.

Example:

$ ngc base-command get-json 1234567 > ./json/test-json.json

To run a job from the file, issue the following:

$ ngc base-command run -f <file.json>

Example:

$ ngc base-command run -f ./json/test-json.json

11.6.3. Overriding Fields in a JSON File

The following is an example JSON:

{
  "dockerImageName": "nvidia/tensorflow:19.11-tf1-py3",
  "aceName": "nv-us-west-2",
  "name": "test.exempt-demo",
  "command": "jupyter lab --ip=0.0.0.0 --allow-root --no-browser --NotebookApp.token='' --notebook-dir=/ --NotebookApp.allow_origin='*' & date; sleep 1h",
  "description": "sample command description",
  "replicaCount": 1,
  "publishedContainerPorts": [
    8888,
    6006
  ],
  "runPolicy": {
    "totalRuntimeSeconds": 3600,
    "premptClass": "RUNONCE"
  },
  "workspaceMounts": [
    {
      "containerMountPoint": "/mnt/democode",
      "id": "KUlaYYvXT56IhuKpNqmorQ",
      "mountMode": "RO"
    }
  ],
  "aceId": 257,
  "networkType": "ETHERNET",
  "datasetMounts": [
    {
      "containerMountPoint": "/data/imagenet",
      "id": 59937
    }
  ],
  "resultContainerMountPoint": "/result",
  "aceInstance": "dgx1v.32g.8.norm.beta"
}

You can specify other arguments in the command, but if they are specified in the JSON file, then the argument values will override the values in the JSON file.

See table below for mapping the field in template to option name in command line.

CLI option

JSON Key

--commandline

command

--description

description

--file

none

--help

none

--image

dockerImageName

--instance

aceInstance

--name

name

--port

port (pass in a list of ports [8888,6006])

--workspace

workspaceMounts (pass in a list of objects)

--ace

ace

--array-type

none

--coscheduling

none

--datasetid

datasetMounts (pass in a list of objects)

--debug

none

--entrypoint

none

--format_type

none

--min-availability

none

--min-timeslice

none

--network

networkType

--org

none

--preempt

runPolicy[preemptClass]

--replicas

replicaCount

--result

resultContainerMountPoint

--shell

none

--start-deadline

none

--team

none

--topology-constraint

none

--total-runtime

runPolicy[totalRuntimeSeconds]

--use-image-entrypoint

none

--waitend

none

--waitrun

none

Example:

Assuming the file pytorch.json is the example JSON file mentioned earlier, the following command will use instance dgx1v.16g.2.norm instead of instance dgx1v.16g.1.norm specified in the JSON.

$ ngc base-command run -f pytorch.json --instance dgx1v.16g.2.norm

Here are some more examples of overriding JSON arguments:

$ ngc base-command run -f pytorch.json --instance dgx1v.16g.4.norm --name "Jupyter Lab repro ml-model.exempt-repro"

$ ngc base-command run -f pytorch.json --image nvcr.io/nvidia/pytorch:20.03-py3

11.7. Exec into a Running Job using CLI

To exec into a running container, issue the following:

$ ngc base-command exec <job_id>

To exec a command in a running container, issue the following:

$ ngc base-command exec --commandline "command" <job_id>

Example using bash

$ ngc base-command exec --commandline "bash -c 'date; echo test'" <job_id>

11.8. Attaching to the Console of a Running Job

When a job is in running state, you can attach to the console of the job both from Web UI and using CLI. The console logs display outputs from both STDOUT and STDERR. These logs are also saved to the joblog.log file in the results mount location.

$ ngc base-command attach <job_id>

11.9. Managing Jobs

This section describes various job management tasks.

11.9.1. Checking Job Name, ID, Status, and Results

11.9.1.1. Using the NGC Web UI

Log into the NGC website, then click Base Command > Jobs from the left navigation menu.

The Jobs page lists all the jobs that you have run and shows the status, job name and ID.

The Status column reports the following progress along with timestamps: Created -> Queued -> Starting -> Running -> Finish.

When a job is in the Queued state, the Status History tab in the Web UI shows the reason for the queued state. The job info command on CLI also displays this detail.

When finished, click on your job entry from the JOBS page. The Results and Log tab both show the output produced by your job.

11.9.1.2. Using the CLI

After launching a job using the CLI, the output confirms a successful launch and shows the job details.

Example:

--------------------------------------------------
 Job Information
 Id: 1854152
 Name: ngc-batch-simple-job-raid-dataset-mnt
 Number of Replicas: 1
 Job Type: BATCH
 Submitted By: John Smith
 Job Container Information
 Docker Image URL: nvidia/pytorch:21.02-py3
 ...
 Job Status
 Created at: 2021-03-19 18:13:12 UTC
 Status: CREATED
 Preempt Class: RUNONCE
----------------------------------------

The Job Status of CREATED indicates a job that was just launched.

You can monitor the status of the job by issuing:

$ ngc base-command info <job-id>

This returns the same job information that is displayed after launching the job, with updated status information.

To view the stdout/stderr of a running job, issue the following:

$ ngc base-command attach <job-id>

All the NGC Base Command Platform CLI commands have additional options; issue ngc --help for details.

11.9.2. Monitoring Console Logs (joblog.log)

Job output (both STDOUT and STDERR) is captured in the joblog.log file.

For more information about result logging behavior, see Managing Results.

11.9.2.1. Using the NGC Web UI

To view the logs for your job, select the job from the Jobs page, then select the Log tab. From here, you can view the joblog.log for each node:

Viewing job logs

Note

If a multi-node job was run with array-type “MPI”, only the log from the first node (replica 0) will contain content. The default behavior is to stream the output of STDOUT and STDERR from all nodes to the joblog.log file on the first node (replica 0). As a result, the remaining log files on the other nodes will be empty.

11.9.2.2. Using the CLI

Issue the following command:

$ ngc result download <job-id>

The joblog.log files and STDOUT/STDERR from all nodes are included with the results, which are downloaded to the current directory on your local disk in a folder labeled job-id.

To view the STDOUT/STDERR of a running job, issue the following:

$ ngc base-command attach <job-id>

11.9.3. Downloading Results (interim and after completion)

11.9.3.1. Using the NGC Web UI

To download job results, do the following:

  1. Select the job from the Jobs page, then select the Results tab.

  2. From the Results page, select the file to download.

The file is downloaded to your Download folder.

11.9.3.2. Using the CLI

Issue the following:

$ ngc result download <job_id>

The results are downloaded to the current directory on your local disk in a folder labelled <job_id>.

11.9.4. Terminating Jobs

11.9.4.1. Using the NGC Web UI

To terminate a job from the NGC website, waiting until the job appears in the Jobs page, then click the menu icon for the job and select Kill Job.

_images/image51.png
11.9.4.2. Using the CLI

Note the job ID after launching the job, then issue the following:

$ ngc base-command kill <job-id>

Example:

$ ngc base-command kill 1854178

Submitted job kill request for Job ID: '1854178'

You can also kill several jobs with one command by listing multiple job IDs as a combination of comma-separated IDs and ranges; for example ‘1-5’, ‘333’, ‘1, 2’, ‘1,10-15’.

11.9.5. Deleting Results

Results remain in the system consuming quota until removed:

$ ngc result remove <job_id>

11.10. Labeling Jobs

This section describes how to create custom labels when submitting a job and ways to use these labels thereafter.

Labels can be used to group or categorize similar jobs, or to search and filter on them.

Labels have the following requirements and restrictions:

  • Labels can be made with alphanumeric characters and “_” (underscore) and can be up to 256 characters long.

  • Labels that start with an “_” (underscore) are reserved for special purposes. Special purpose features are planned for a future release.

  • There is a maximum of 20 labels per job.

11.10.1. Creating Labels

Category Name

Description

Expected Values

Normal

Can be generated by any user with access to the job.

Alphanumeric characters and “_” (underscores) up to 256 characters long and cannot start with “_”.

Admin Labels

Can only be generated, added, and removed by admins.

Label that begins with a double underscore “__”.

System Labels

Labels that define a system behavior. Chosen from a pre-generated list and added or removed by anyone with access to the job.

Label that begins with a single underscore “_”.

System Label

_locked_labels

Label that, if present, disallows adding or removing any other labels by anyone.

11.10.1.1. Using the NGC Web UI

In the Launch Job section of the Create Job page, enter a label in the Custom Labels field. Press Enter to apply the changes.

You can also specify more than one label to categorize one job into multiple groups, provided you add the labels one at a time (that is, press Enter after entering each label).

Example:

Create a custom label “nv_test_job_label_1001”

Creating labels
11.10.1.2. Using the CLI

You can assign job labels dynamically when submitting jobs using the CLI.

Issue the following for a single label:

$ ngc base-command run .. --label <label_1>

For multiple labels, issue the following:

$ ngc base-command run .. --label <label_1> --label <label_2>

System admins may create labels beginning with the __ (double underscore).

$ ngc base-command run .. --label <__some_label>

11.10.2. Modifying Labels

Labels for a job can be changed at any time during the lifetime of a job, as long as they are not locked.

11.10.2.1. Using the NGC Web UI

To modify a job label, do the following:

  • In the Custom Labels field, click on the “X” on the label to delete.

  • Add a new label and press Enter.

Modifying a job label
11.10.2.2. Using the CLI

The following examples show ways to modify labels in a job.

  • Clear (remove) all labels from a job

    $ ngc base-command update .. --clear-label <job-id>
    
  • Add a label to a job

    $ ngc base-command update .. --label "__bad" <job-id>
    
  • Lock all labels currently assigned to a job

    $ ngc base-command update .. --lock-label <job-id>
    
  • Unlock all labels currently assigned to a job

    $ ngc base-command update .. --unlock-label <job-id>
    
  • Remove a specific label from a job

    $ ngc base-command update .. --remove-label "test*" --remove-label "try" <job-id>
    

Admin system labels (starting with __ double underscores) can only be removed by users with admin privileges.

11.10.3. Searching/Sorting Labels

You can search on labels using the wildcard characters * and ? and filter using include/exclude patterns. Reserved labels are searchable by all users. Searching with multiple labels will return jobs with any of the listed labels. Search patterns are also case-insensitive.

11.10.3.1. Using the NGC Web UI

Enter a search term in the search field and press Enter.

Example:

Search on jobs with a label that starts with “nv_test_job_label*”

Search for job label

The results of the search are as follows:

Search results for job label
11.10.3.2. Using the CLI

You can exclude certain labels from a search.

  • Here is an example to list all jobs with “Pytorch” label but not with “bad” label:

    $ ngc base-command list --label "Pytorch" --exclude-label "bad"
    
  • Here are some additional examples using the exclude options:

    $ ngc base-command list --label "__tutorial" --exclude-label "qsg"
    
    $ ngc base-command list --label "delete" --exclude-label "publish"
    
  • Here is an example of listing all labels except for label “aaa”:

    $ ngc base-command list --label * --exclude-label "aaa"
    
  • Here is an example to list multiple labels with a comma separator, which will list jobs with the labels “Pytorch” and/or “active” (case-insensitive):

    $ ngc base-command list --label "Pytorch","active"
    

11.10.4. Viewing Labels

You can view job labels using the following methods.

11.10.4.1. Using the CLI

Example:

To view a list of all the labels defined or used within an org, issue the following:

$ ngc base-command list --column labels

Example:

To view a label for a particular job:

$ ngc base-command info <jobid>

The list of labels are returned in the following order:

  • system defined labels (starts with an underscore “_”)

  • labels added by an administrator (starts with a double underscore “__”)

  • other labels (sorted alphabetically)

11.10.5. Cloning/Templating Jobs

When jobs are cloned or created from a template, the custom labels are retained while the system or reserved labels are removed by default.

Refer to Cloning an Existing Job in the user guide for more information.

11.10.5.1. Using the NGC Web UI

In the Base Command > Jobs page, click the “…” menu and select Clone Job.

Clone Job command in job menu

Note that custom labels are retained in the newly cloned job.

Cloning a job with custom labels
11.10.5.2. Using the CLI

Here is an example using the cloning options:

$ ngc base-command run .. -f jobdef.json --label "copy","rerun"

11.11. Scheduling Jobs

By default, jobs will run in the order they are submitted if resources and quota are available. Sometimes, there is a need to submit a high-priority job ahead of others. Two flags, order, and priority, can be set to allow for greater control over when jobs are run.

  • Priority can be HIGH, NORMAL, or LOW.

  • Order can be an integer between 1 and 99, with lower numbers executing first.

  • By default, the priority is NORMAL and the order is 50.

Flags

Values

Default

Description

Order

[1-99]

50

Affects the execution order of only your jobs.

Priority

[HIGH, NORMAL, LOW]

NORMAL

Affects the execution order of all jobs on the cluster.

11.11.1. Job Order

Jobs can be assigned an order number ranging from 1 to 99 (default 50), with lower numbers executing first. The order number only changes the order of your jobs with the same priority and does not affect the execution of another user’s jobs. Order will not affect preemption behavior.

11.11.2. Job Priority

Priority can be HIGH, NORMAL (default), or LOW. Each priority is effectively its own queue on the cluster. All jobs in the higher priority queue will be run before jobs in the lower priority queues and will even preempt lower priority jobs if they are submitted as RESUMABLE. Since this can lead to NORMAL priority jobs being starved in an oversubscribed cluster, the ability for you to change your job priority must be enabled by your team or org admin.

In this example queue for a single user, jobs will be executed from top to bottom.

Priority

Order

HIGH

1

HIGH

50

NORMAL

10

NORMAL

50

NORMAL

50

NORMAL

99

LOW

50

The following shows how to set the order and priority when submitting a job. Appending -h or --help to a command will provide more information about its flags.

$ ngc base-command run --name test-order ... --order 75 --priority HIGH
--------------------------------------------------------
 Job Information
   Id: 1247749
   Name: test-order
   ...
   Order: 75
   Priority: HIGH

You can also see the order and priority values when listing jobs.

$ ngc base-command list --column order --column priority
+---------+-------+----------+
| Id      | Order | Priority |
+---------+-------+----------+
| 1247990 | 75    | HIGH     |
| 1247749 | 75    | HIGH     |
| 1247714 | 12    | HIGH     |
| 1247709 | 50    | NORMAL   |
| 1247638 | 99    | HIGH     |
| 1247598 | 35    | NORMAL   |
+---------+-------+----------+

# Filtering only the high priority jobs
$ ngc base-command list --priority HIGH --column order --column priority
+---------+-------+----------+
| Id      | Order | Priority |
+---------+-------+----------+
| 1247990 | 75    | HIGH     |
| 1247749 | 75    | HIGH     |
| 1247714 | 12    | HIGH     |
| 1247638 | 99    | HIGH     |
+---------+-------+----------+

Note: Due to limitations of the current release, these are the steps to change the order or priority of a job.

  • Clone the job.

  • Before submitting, set the order and priority of the cloned job.

  • Delete the old job.

11.11.3. Configuring Job Preemption

Support for job preemption is an essential requirement for clusters to enable priority-based task scheduling and execution and improve resource utilization, fitness, fairness, and starvation handling. This is especially true in smaller clusters, which tend to operate under high load conditions, and where scheduling becomes a critical component impacting both revenue and user experience.

Job preemption in NGC clusters combines user-driven preempt and resume support, scheduler-driven system preemption, and operations-driven automatic node-drain support. Job preemption targets a specific class of jobs called resumable jobs ( --preempt RESUMABLE ). Resumable jobs in NGC have the advantage of being allowed longer total runtimes on the cluster than “run once” jobs.

11.11.3.1. Enabling Preemption in a Job

To enable the preemption feature, users need to launch the job with the following flags:

--preempt
--min-timeslice XX
11.11.3.2. Using the --preempt flag

The --preempt flag takes the following arguments.

--preempt <RUNONCE | RESUMABLE | RESTARTABLE>

Where

  • RUNONCE: is the default condition and specifies that the job not be restarted. This condition may be required to avoid adverse actions taken by the failed job.

  • RESUMABLE: allows the job to resume where it left off after preemption, using the same command that started the job. Typically applies week-long simulations with periodic checkpoints, nearly all HPC apps and DL Frameworks, and stateless jobs.

  • RESTARTABLE: (Currently not supported) specifies that the job must be restarted from the initial state if preempted. Typically applies to short jobs where resuming is more work than restarting, software with no resume ability, or jobs without workspaces.

11.11.3.3. Using the --min-timeslice flag

Users must provide an additional option of specifying a minimum timeslice, the minimum amount of time that a resumable job is guaranteed to run once it gets to a running state. This option allows the user to specify a time window during which the job can make enough progress before preempting and before a checkpoint is made of its state so that the job can resume if it gets preempted. Specifying a smaller timeslice may help the user get their job scheduled faster during high-load conditions.

11.11.3.4. Managing Checkpoints

Users are responsible for managing their checkpoints in workspaces.

They can accomplish this by adding these controllable attributes in the Job Script.

  1. Training script saves checkpoints in regular intervals.

  2. On resuming training, the script should read the existing checkpoint and resume training from the latest saved checkpoint.

11.11.3.5. Preempting a Job

To preempt a job, use the ngc base-command preempt command.

Syntax

$ ngc base-command preempt <job_id>
11.11.3.6. Resuming a Preempted Job

To preempt a job, use the ngc base-command resume command.

Syntax

$ ngc base-command resume <job_id>

Example Workflow

  1. Launch a job with preempt set to “RESUMABLE.”

    $ ngc base-command run --name "preemption-test" --preempt RESUMABLE
    --min-timeslice 300s --commandline python train.py --total-runtime 72000s
    --ace nv-eagledemo-ace --instance dgxa100.40g.1.norm --result /results
    --image "nvidia/pytorch:21.02-py3"
    --------------------------------------------------
       Job Information
          Id: 1997475
          Name: preemption-test
          Number of Replicas: 1
          Job Type: BATCH
          Submitted By: John Smith
          ...
    

    This workload uses the Pytorch container and runs a dummy training script train.py.

  2. Once the job is running, you can preempt it.

    $ ngc base-command preempt 1997475
    

    Submitted job preempt request for Job ID: ‘1997475’

  3. To resume the preempted job, issue the ngc base-command resume command.

    $ ngc base-command resume 1997475
    
    Submitted job resume request for Job ID: '1997475'
    

    The Status History for the job on the NGC Base Command Platform web application shows its progression.

    _images/job-status-history.png

12. Telemetry

This chapter describes the system telemetry feature of Base Command Platform. In this chapter, you will learn about the different metrics collected from a workload and plotted in UI enabling you to monitor the efficiency of a workload in near real time (approximately 30 seconds). The telemetry can be accessed using both the web UI and CLI.

NVIDIA Base Command Platform provides system telemetry information for jobs and also allows jobs to send telemetry to Base Command Platform to be recorded. This information (graphed in the Base Command Platform dashboard and also available from the CLI in a future release) is useful for providing visibility into how jobs are running. This lets users

  • Optimize jobs.

  • Debug jobs.

  • Analyze job efficiency.

Job telemetry is automatically generated by Base Command Platform and provides GPU, Tensor Core, CPU, GPU Memory, and IO usage information for the job.

The following table provides a description of all the metrics that are measured and tracked in the Base Command Platform telemetry feature:

Note

The single numbers given for attributes that are measured for each GPU will be the mean by default.

Metric

Definition

Job Runtime

How long the job has been in the RUNNING state (HH:MM:SS)

Time GPUs Active

The percentage of time over the entire job that the graphics engine on the GPUs have been active (GPU Active % > 0%).

GPU Utilization

One of the primary metrics to observe. It is defined as the percentage of time one or more GPU kernels are running over the last second, which is analogous to a GPU being utilized by a job.

GPU Active %

Percent of GPU cores that are active. The graphics engine is active if a graphics/compute context is bound and the graphics pipe or compute pipe is busy. Effectively the GPU utilization for each GPU.

Tensor Cores Active %

The percentage of cycles the tensor (HMMA) pipe is active (off the peak sustained elapsed cycles).

GPU Memory Active

This metric represents the percentage of time that the GPU’s memory controller is utilized to either read or write from memory.

GPU Power

Shows the power used by each GPU in Watts, as well as the percentage of its total possible power draw.

GPU Memory Used (GB)

This metric shows how much of the GPU’s video memory has been used.

PCIe Read/Write BW

This metric specifies the number of bytes of active PCIe read/transmit data including both header and payload.

Note that this is from the perspective of the GPU, so copying data from host to device (HtoD) or device to host (DtoH) would be reflected in these metrics.

CPU Usage

This metric gives the % CPU usage over time.

System Memory

Total amount of system memory being used by the job in GB.

Raid File System

Amount of data in the /raid folder. By default the max is 2 TB. More info at Local Scratch Space.

[Dataset | Workspace | Results] IOPS Read

Number of read operations per second accessing the mounted [Dataset | Workspace | Results] folders.

[Dataset | Workspace | Results] IOPS Write

Number of write operations per second accessing the mounted [Dataset | Workspace | Results] folders.

[Dataset | Workspace | Results] BW Read

Shows the total amount of data (in GB) read from the mounted [Dataset | Workspace | Results] folders.

[Dataset | Workspace | Results] BW Write

Shows the total amount of data written to the mounted [Dataset | Workspace | Results] folders.

Network BW [TX | RX]

Shows the total amount of data transmitted from the job (TX) and received by the job (RX).

NV Link BW [TX | RX]

Shows NVLink bandwidth being used in GB/s. NVLink direct is a GPU-GPU interconnect for GPUs on the same node. This is a per replica metric for Multi Node Jobs and a per node metric for partial node workloads.

12.1. Viewing Telemetry Information from the NGC Web UI

Click Jobs, select one of your jobs, then click the Telemetry tab.

The following are example screenshots of the Telemetry tab.

Note

The screenshot is presented for example purpose only - the exact look may change depending on the NGC release.

_images/image59.png

The floating window gives a breakdown of the telemetry metrics at each time slice for more informative walkthrough of the metrics.

The single numbers given for attributes that are measured for each GPU is mean/average by default but we can also visualize minimum or maximum statistics using the drop down menu.

_images/image55.png

Viewing the telemetry in Min Statistics:

_images/image23.png

Viewing the telemetry in Max Statistics:

_images/image55.png

We can see the per-GPU metrics in the floating window as shown below.

_images/image20.png

The telemetry shows the Overall GPU Utilization and GPU Active Percentage along with the Job Runtime on top. Following that we have more detailed information in each section of the telemetry.

GPU Active, Tensor Cores Active, GPU Memory Active and GPU Power:

_images/image54.png

GPU memory Used:

_images/image48.png

PCIe Read and Write BW:

_images/image25.png

NVLink BW:

_images/image29.png

CPU Usage and System Memory:

_images/image41.png

12.2. Telemetry for Multinode Jobs

By default, the telemetry shows averaged out for all the Nodes. Switching between replicas is easy by selecting which Node you want to see the metric for clicking Select Node.

The metrics then can be seen for each replica as shown below:

_images/image15.png

Replica 0:

_images/image28.png

Replica 1:

_images/image60.png

13. Advanced Base Command Platform Concepts

This chapter describes the more advanced features of Base Command Platform. In this chapter, you will learn about in-depth use cases of a special feature or in-depth attributes of an otherwise common feature.

13.1. Multi-node Jobs

NVIDIA Base Command Platform supports MPI-based distributed multi-node jobs in a cluster. This lets you run the same job on multiple nodes simultaneously, subject to the following requirements.

  • All GPUs in a node must be used.

  • Container images must include components such as OpenMPI 3.0+ and Horovod as needed.

13.1.1. Defining Multi-node Jobs

For a multi-node job, NVIDIA Base Command Platform schedules (reserves) all nodes as specified by the --replicas option. The specified command line in the job definition is executed only on the parent node (launcher), which is identified by replica id 0. It is the responsibility of the user to execute commands on child nodes (replica id >0), by utilizing mpirun command as shown in examples in this section.

NVIDIA Base Command Platform provides the required info, mostly exporting relevant ENV variables, to enable invocation of commands on all replicas and enable multi-node training using distributed PyTorch or Horovod.

Multi-node job command line must address the following two levels of inter-node interactions for a successful multi-node training job.

  1. Invoke the command on replicas, typically all, using mpirun.

  2. Include node details as args to distributed training scripts (such as parent node address or host file).

For this need, NVIDIA Base Command Platform sets the following variables in the job container runtime shell.

ENV Var

Definition

NGC_ARRAY_INDEX

Set to the index of the replica. Set to 0 for the Parent node.

NGC_ARRAY_SIZE

Set to the number of replicas in the job definition.

NGC_MASTER_ADDR

Address (DNS service) to reach the Parent node or Launcher. Set on all replicas. For replica 0, it points to localhost.

For use with distributed training (such as PyTorch).

NGC_REPLICA_ID

Same as NGC_ARRAY_INDEX.

OMPI_MCA_orte_default_hostfile

This is only valid on the Parent node, or replica 0.

Set to the host file location for use with distributed training (like Horovod).

13.1.2. Understanding the –replicas argument

The following table shows the corresponding node count and replica ids for the --replicas argument.

–replicas

Number of nodes

Replica IDs

--replicas 0

Not applicable

Not applicable

--replicas 1

Not applicable

Not applicable

--replicas 2

2 (1x parent, 1x child)

0, 1

--replicas 3

3 (1x parent, 2x child)

0, 1, 2

--replicas 4

4 (1x parent, 3x child)

0, 1, 2, 3

--replicas N

N (1x parent, (N-1)x child

0, 1, 2, …(N-1)

13.1.3. Starting a Multi-node Job from the NGC Web UI

Multi-node jobs can also be started and monitored with the NGC Web UI.

Note

In order for a container to be selected for a multi-node job, it must first be tagged as a Multi-node Container in the Web UI.

Private registry users can tag the container from the container page: Click the menu icon, select Edit, then check the Multi-node Container checkbox and save the change. Public containers that are multi-node capable must also be tagged accordingly by the publisher.

  1. Login to the NGC Dashboard and select Jobs from the left-side menu.

  2. In the upper right select Create a job.

  3. Click the Create a Multi-node Job tab.

    _images/image44.png
  4. Under the Accelerated Computing Environment section, select your ACE and Instance type.

    _images/image16.png
  5. Under the Multi-node section, select the replica count to use.

    _images/image14.png
  6. Under the Data Input section, select the Datasets and Workspaces as needed.

  7. Under the Data Output section, enter the result mount point.

  8. Under the Container Selection section, select the container and tag to run, any commands to run inside the container, and an optional container port.

  9. Under the Launch Job section, provide a name for the job and enter the total run time.

  10. Click Launch.

13.1.4. Viewing Multi-node Job Results from the NGC Web UI

  1. Click Jobs from the left-side menu.

    _images/image26.png
  2. Select the Job that you want to view.

  3. Select one of the tabs - Overview, Telemetry, Status History, Results, or Log. The following example shows Status History. You can view the history for the overall job or for each individual replica.

    _images/image56.png

13.1.5. Launching Multi-node Jobs Using the NGC CLI

​Along with other arguments required for running jobs, the following are the required arguments for running multi-node jobs.

Syntax:

$ ngc base-command run \
...
--replicas <num>
--total-runtime <t>
--preempt RUNONCE
...

Where:

  • --replicas: specifies the number of nodes (including the primary node) upon which to run the multi-node parallel job.

  • --total-runtime: specifies the total time the job can run before it is gracefully shut down. Format: [nD] [nH] [nM] [nS].

    Note

    To find the maximum run time for a particular ACE, use the following command:

    $ ngc ace info <ace name> --org <org id> --format_type json
    

    The field “maxRuntimeSeconds” in the output contains the maximum run time.

  • --preempt RUNONCE: specifies the RUNONCE job class for preemption and scheduling.

Example 1: To run a Jupyterlab instance on node 0

$ ngc base-command run \
--name "multinode-jupyterlab" \
--total-runtime 3000s \
--instance dgxa100.80g.8.norm \
--array-type "MPI" \
--replicas "2" \
--image "nvidia/tensorflow:21.03-tf1-py3" \
--result /result \
--port 8888 \
--commandline "set -x && date && nvidia-smi && \
jupyter lab --ip=0.0.0.0 --allow-root --no-browser --NotebookApp.token='' --notebook-dir=/ --NotebookApp.allow_origin=*"

mpirun and bcprun commands can then be run from within Jupyterlab after launching.

Example 2: Using mpirun

$ ngc base-command run \
--name "multinode-simple-test" \
--total-runtime 3000s \
--instance dgxa100.80g.8.norm \
--array-type "MPI" \
--replicas "2" \
--image "nvidia/tensorflow:21.03-tf1-py3" \
--result /result \
--port 8888 \
--commandline "mpirun --allow-run-as-root -x IBV_DRIVERS=/usr/lib/libibverbs/libmlx5 -np \${NGC_ARRAY_SIZE} -npernode 1 bash -c 'hostname'"

Note that mpirun is used to execute the commands on all the replicas, specified via NGC_ARRAY_SIZE. The actual command (highlighted in a different color in the example) to run on each replica is included as a bash command input (with special chars escaped as needed).

Example 3: Using mpirun with PyTorch

Note the use of NGC_ARRAY_SIZE, NGC_ARRAY_INDEX, and NGC_MASTER_ADDR.

$ ngc base-command run \
--name "multinode-pytorch" \
--total-runtime 3000s \
--instance dgxa100.80g.8.norm \
--array-type "MPI" \
--replicas "2" \
--image "nvidia/pytorch:22.11-py3" \
--result /result \
--port 8888 \
--commandline "python3 -m torch.distributed.launch \
--nproc_per_node=8 \
--nnodes=\${NGC_ARRAY_SIZE} \
--node_rank=\${NGC_ARRAY_INDEX} \
--master_addr=\${NGC_MASTER_ADDR} train.py"
13.1.5.1. Targeting Commands to a Specific Replica

CLI can be used to execute a command in a running job container by using the following command.

$ ngc base-command exec <job_id>

For a multi-node workload, there are multiple replicas running containers. The replicas are numbered with zero-based indexing. The above command, specifying just the job id, targets the exec command to the first replica, which is indexed at 0 (zero). You may need to run a command on a different replica in a multi-node workload, which can be achieved by the following option.

$ ngc base-command exec <job_id>:<replica-id>

When omitted, the first replica (id 0) is targeted for the command.

13.1.5.2. Viewing Multi-node Job Status and Information

The status of the overall job can be checked with the following command:

$ ngc base-command info <job_id>

To check the status of one of the replicas, issue:

$ ngc base-command info <job_id>:<replica_id>

where <replica_id > is from 0 (zero) to (number of replicas)-1.

The following example shows the status of each replica of a two-replica job:

$ ngc base-command info 1070707:0
--------------------------------------------------
 Replica Information
 Replica: 1070707:0
 Created At: 2020-03-04 22:39:00 UTC
 Submitted By: John Smith
 Team: swngc-mnpilot
 Replica Status
 Status: CREATED
--------------------------------------------------
$ ngc base-command info 1070707:1
--------------------------------------------------
 Replica Information
 Replica: 1070707:1
 Created At: 2020-03-04 22:39:00 UTC
 Submitted By: John Smith
 Team: swngc-mnpilot
 Replica Status
 Status: CREATED
--------------------------------------------------

To get information about the results of each replica, use:

$ ngc result info <job_id>:<replica_id>

13.1.6. Launching Multi-node Jobs with bcprun

When launching multi-node jobs, NGC installs bcprun, a multi-node application launcher utility on Base Command Platform clusters. The primary benefits of bcprun are the following:

  • Removes dependency on mpirun in the container image.

  • Provides srun equivalence to allow users to easily migrate jobs between Slurm and Base Command Platform clusters.

  • Provides a unified launch mechanism by abstracting a framework-specific environment needed by distributed DL applications.

  • Allows users to submit commands as part of a batch script.

Syntax:

$ bcprun --cmd '<command-line>'

Where:

  • <command-line> is the command to run

Example:

$ bcprun --cmd 'python train.py'

Optional Arguments

-n <n>, --nnodes <n>

Number of nodes to run on. (type: integer)

Range: min value: 1, max value: R,

where R is the max number of replicas requested by the NGC job.

Default value: R

Example:--nnodes 2

-p <p>, --npernode <p>

Number of tasks per node to run. (type: integer)

Range: min value: 1, max value: (none)

Default value: environment variable NGC_NTASKS_PER_NODE, if set; otherwise 1.

Example:--npernode 8

-e <e>, --env <e>

Environment variables to set with format ‘key=value’.

(type: string)

Each variable assignment requires a separate -e or --env flag.

Default value: (none)

Example:--env 'var1=value1' --env 'var2=value2'

-w <w>, --workdir <w>

Base directory from which to run <cmd >. (type: string)

May include environment variables defined with --env.

Default value: environment variable PWD (current working directory)

Example:

--workdir '$WORK_HOME/scripts' --env
                'WORK_HOME=/mnt/workspace'

-l <l>, --launcher <l>

Run <cmd > using an external launcher program. (type: string)

Supported launchers: mpirun, horovodrun

  • mpirun: maps to OpenMPI options

(https://www.open-mpi.org/)

  • horovodrun: maps to Horovod options

(https://horovod.ai/)

Note: This option assumes the launcher exists and is in PATH.

Launcher-specific arguments (not part of bcprun options) can be provided as a suffix.

Example:--launcher 'mpirun --allow-run-as-root'

Default value: (none)

-a, --async

Run with asynchronous failure support enabled, i.e. a child process of bcprun can exit on failure without halting the program.

The program will continue while at least one child is running.

The default semantics of bcprun is to halt the program when any child process launched by bcprun exits with error.

-d, --debug

Print debug info and enable verbose mode.

This option also sets the following environment variables for additional debug logs:

NCCL_DEBUG=INFO

TORCH_DISTRIBUTED_DEBUG=INFO

-log, --logdir

Note: For jobs with array-type “PYTORCH”.

Override the default location for saving job logs. This location will contain the STDOUT and STDERR logs for every worker-node.

The -d or --debug argument must also be enabled for this argument to function.

Example:bcprun --npernode 8 -d –logdir "/workspace" -c "python3 train.py"

-no_redirect, --no_redirect

When this flag is used, bcprun will print the logs to terminal stdout/stderr instead of redirecting to joblog.log and per-rank per-node files.

-j, --jsonlogs

Note: For jobs with array-type “PYTORCH”.

When this flag is used, bcprun will print the logs using fluent-bit’s json wrapper with timestamp and filename apart from the logs.

Without this flag, it would write raw output. This flag is only applicable when process stdout/stderr are being redirected to logs.

-v, --version

Print version info.

-h, --help

Print this help message.

13.1.6.1. Basic Usage

The following multi-node job submission command runs the hostname command on two nodes using bcprun.

ngc base-command run --name "getting-started" \
--image "nvidia/pytorch:20.06-py3" --commandline "bcprun --cmd hostname" \
--preempt RUNONCE --result /result --ace nv-us-west-2 --org nvidian \
--team swngc-mnpilot --instance dgx1v.32g.8.norm --total-runtime 1m \
--replicas 2 --array-type MPI

The job will print the hostnames of each replica and will be similar to the following output.

1174493-worker-0
1174493-worker-1
  • bcprun is only available inside a running container in Base Command Platform clusters. Hence, the bcprun command and its arguments can be specified (either directly or within a script) only as part of the --commandline argument of the ngc job

  • Multi-node ngc jobs have to specify the --array-type argument to define the kind of environment required inside the container. The following array-types are supported:

    • MPI: This is the legacy array-type for ngc jobs to launch multi-node applications from a single launcher node (aka mpirun launch model)

    • PYTORCH: This will setup the environment to launch distributed PyTorch applications with a simple command. Example:. bcprun --npernode 8 --cmd 'python train.py'

  • bcprun requires the user application command (and its arguments) to be specified as a string argument of flag --cmd (or -c in short form)

13.1.6.2. Using --nnodes / -n

This option specifies how many nodes to launch the command on to. While the maximum number of nodes allocated to a ngc job is specified by --replicas, the user can launch the application on a subset of nodes using --nnodes (or -n in the short form). In the absence of this option, the default behavior of bcprun is to launch the command on all the replica nodes.

ngc base-command run --name "getting-started" --image "nvidia/pytorch:20.06-py3" \
--commandline "bcprun --nnodes 3 --cmd hostname"--preempt RUNONCE --result /result \
--ace nv-us-west-2 --org nvidian --team swngc-mnpilot --instance dgx1v.32g.8.norm \
--total-runtime 1m --replicas 4 --array-type MPI

For example, although four replicas are allocated, bcprun will run hostname on only 3 nodes and produce the following output.

1174495-worker-0
1174495-worker-1
1174495-worker-2
13.1.6.3. Using --npernode / -p

Multiple instances of an application task can be run on each node by specifying the --npernode (or -p in the short form) option as follows:

ngc base-command run --name "getting-started" --image "nvidia/pytorch:20.06-py3" \
--commandline "bcprun --npernode 2 --cmd hostname"--preempt RUNONCE --result /result \
--ace nv-us-west-2 --org nvidian --team swngc-mnpilot --instance dgx1v.32g.8.norm \
--total-runtime 1m --replicas 2 --array-type MPI

In this case, two instances of hostname are run on each node, which produces the following output:

1174497-worker-0
1174497-worker-0
1174497-worker-1
1174497-worker-1
13.1.6.4. Using --workdir / -w

The user can specify the path of the executable using the --workdir option (or -w in the short form). This example shows the use of bcprun for a PyTorch DDP model training job on 2-nodes, and 8 GPUs per node; and it illustrates usage of the --workdir option

ngc base-command run --name "pytorch-job" --image "nvidia/pytorch:21.10-py3" \
--commandline "bcprun --npernode 8 --cmd 'python train.py' --workdir /workspace/test" \
--workspace MLumas39SZmqY8z2NAqoHw:/workspace/test:RW --result /result --preempt RUNONCE \
--ace netapp-sjc-4-ngc-dev6 --instance dgxa100.40g.8.norm --replicas 2 --array-type "PYTORCH" \
--total-runtime 30m
13.1.6.5. Using --env / -e

The user can set environment variables that can be passed to rank processes and used by the launched command using the --env option (or -e in the short form). The following example shows the user is able to set the debug level of NCCL output to INFO.

ngc base-command run --name "pytorch-job" --image "nvidia/pytorch:21.10-py3" \
--commandline "bcprun --npernode 8 --cmd 'python train.py' --workdir /workspace/test \
--env NCCL_DEBUG=INFO" --workspace MLumas39SZmqY8z2NAqoHw:/workspace/test:RW \
--result /result --preempt RUNONCE --ace netapp-sjc-4-ngc-dev6 --instance dgxa100.40g.8.norm \
--replicas 2 --array-type "PYTORCH" --total-runtime 30m
13.1.6.6. Using bcprun in a Script

bcprun commands can be chained together into a batch script and invoked by the job commandline as follows.

ngc base-command run --name "pytorch-job" --image "nvidia/pytorch:21.10-py3" \
--commandline "bcprun.sub" --workspace MLumas39SZmqY8z2NAqoHw:/workspace/test:RW \
--result /result --preempt RUNONCE --ace netapp-sjc-4-ngc-dev6 --instance dgxa100.40g.8.norm \
--replicas 2 --array-type "PYTORCH" --total-runtime 30m

where bcprun.sub is an executable script containing many bcprun commands as follows:

#!/bin/bash
bcprun --npernode 8 --cmd "python train.py --phase=1"
bcprun --npernode 8 --cmd "python train.py --phase=2"
13.1.6.7. PyTorch Example

bcprun greatly simplifies the launching of distributed PyTorch applications on BCP clusters by automatically abstracting the environment required by torch.distributed. A multi-node PyTorch Distributed Data Parallel (DDP) training job using a python training script (train.py) could be launched by mpirun as follows:

mpirun -np 2 -npernode 1 python -m torch.distributed.launch --nproc_per_node=8 \
--nnodes=${NGC_ARRAY_SIZE} --node_rank=${NGC_ARRAY_INDEX} --master_addr=${NGC_MASTER_ADDR} train.py

In contrast, the command using bcprun would look something like this:

bcprun -p 8 -c 'python train.py'

With bcprun, we have two advantages:

  1. The container has no dependence on MPI or mpirun

  2. Distributed PyTorch-specific parameters are now abstracted to a unified launch mechanism

Combined with the --array-type PYTORCH ngc job parameter, the complete job specification is shown below:

ngc base-command run --name "pytorch-test" --image "nvidia/pytorch:21.10-py3" \
--commandline "bcprun -d -p 8 -c 'python train.py' -w /workspace/test" \
--workspace MLumas39SZmqY8z2NAqoHw:/workspace/test:RW --result /result --preempt RUNONCE \
--ace netapp-sjc-4-ngc-dev6 --instance dgxa100.40g.8.norm --replicas 2 --array-type "PYTORCH" \
--total-runtime 30m

Environment Variables

The NGC job parameter --array-type PYTORCH is used by bcprun to set the environment variables required for the PyTorch training rank processes and conforms to the requirements of torch.distributed. A PyTorch distributed application can depend on the following environment variables to be set by bcprun when launching the training script:

LOCAL_RANK

RANK

GROUP_RANK

LOCAL_WORLD_SIZE

WORLD_SIZE

ROLE_WORLD_SIZE

MASTER_ADDR

MASTER_PORT

NGC_RESULT_DIR

Optionally, if the -d, --debug argument is enabled in the bcprun command, the following environment variables will be set:

NCCL_DEBUG=INFO

TORCH_DISTRIBUTED_DEBUG=INFO

PyTorch local rank: ‘--local-rank’ flag vs ‘LOCAL_RANK’ env var

bcprun always sets the environment variable LOCAL_RANK regardless of PyTorch version.

bcprun also passes --local-rank flag argument by default as of this release.

The --local-rank flag has been deprecated starting from PyTorch Version >= 1.9. Training scripts are expected to use the environment variable LOCAL_RANK instead.

bcprun will pass the flag argument --local-rank only for PyTorch version < 1.10. For all PyTorch versions >= 1.10, the --local_rank flag argument will NOT be passed to the training script by default. If you depend on parsing --local-rank in your training script for PyTorch versions >= 1.10, you can override the default behavior by setting environment variable NGC_PYTORCH_USE_ENV=0. Conversely, setting environment variable NGC_PYTORCH_USE_ENV=1 for PyTorch version < 1.10 will suppress passing --local-rank flag argument.

13.1.6.8. BERT Example

The following example illustrates the use of bcprun to run a training job for the PyTorch BERT model.

ngc base-command run --name "bert_example" --image "nvidia/dlx_bert:21.05-py3" \
--commandline "cd /workspace/bert && BATCHSIZE=\$(expr 8192 / \$NGC_ARRAY_SIZE) LR=6e-3 GRADIENT_STEPS=\$(expr 128 / \$NGC_ARRAY_SIZE) PHASE=1 NGC_NTASKS_PER_NODE=8 ./bcprun.sub && BATCHSIZE=\$(expr 4096 / \$NGC_ARRAY_SIZE) LR=4e-3 GRADIENT_STEPS=\$(expr 256 / \$NGC_ARRAY_SIZE) PHASE=2 NGC_NTASKS_PER_NODE=8 ./bcprun.sub" \
--workspace MLumas39SZmqY8z2NAqoHw:/workspace/bert:RW --datasetid 208137:/workspace/data \
--result /result --preempt RUNONCE --ace netapp-sjc-4-ngc-dev6 --instance dgxa100.40g.8.norm \
--replicas 2 --array-type "PYTORCH" --total-runtime 2D
13.1.6.9. SSD Example
ngc base-command run --name "SSD_example" --image "nvidia/dlx_ssd:latest" \
--commandline "cd /workspace/ssd; ./ssd_bcprun.sub" --workspace SSD_dev6:/workspace/ssd:RW \
--result /result --preempt RUNONCE --ace netapp-sjc-4-ngc-dev6 --instance dgxa100.40g.8.norm \
--replicas 2 --array-type "PYTORCH" --total-runtime 10h
13.1.6.10. PyTorch Lightning Example

An example of a PyTorch Lightning training job is shown below. Note that array-type PYTORCH is used for PTL jobs.

ngc base-command run --name "ptl-test" --image "nvidia/nemo_megatron:pyt21.10" \
--commandline "bcprun -p 8 -d -c 'python test_mnist_ddp.py'" \
--workspace MLumas39SZmqY8z2NAqoHw:/workspace/bert:RW --result /result --preempt RUNONCE \
--ace netapp-sjc-4-ngc-dev6 --instance dgxa100.40g.8.norm --replicas 2 --array-type "PYTORCH" \
--total-runtime 30m

Note: bcprun sets environment variables (“RANK”, “GROUP_RANK”, “LOCAL_RANK”, “LOCAL_WORLD_SIZE”) which allows PyTorch Lightning to infer the torchelastic environment.

13.1.6.11. MPI Example

For applications which require MPI and mpirun, bcprun allows such applications by defining the --launcher="mpirun" option. An example of a MPI multinode job using bcprun is as follows.

ngc base-command run --name "bcprun-launcher-mpirun" --image "nvidia/mn-nccl-test:sharp" \
--commandline "bcprun -l mpirun -p 8 -c 'all_reduce_perf -b 1G -e 1G -g 1 -c 0 -n 200'" \
--result /result --preempt RUNONCE --ace netapp-sjc-4-ngc-dev6 --instance dgxa100.40g.8.norm \
--replicas 2 --array-type "MPI" --total-runtime 30m

The array-type here is set to “MPI”. bcprun invokes the multi-node job using the defined mpirun launcher. The equivalent mpirun command invoked by bcprun is as follows.

mpirun --allow-run-as-root -np 16 -npernode 8 all_reduce_perf -b 1G -e 1G -g 1 -c 0 -n 200

13.2. Job ENTRYPOINT

NGC Base Command Platform CLI now provides the option of incorporating Docker ENTRYPOINT when running jobs.

Some NVIDIA deep learning framework containers rely on ENTRYPOINT to be called for full functionality. The following functions in these containers rely on ENTRYPOINT:

  • Version banner to be printed to logs

  • Warnings/errors if any platform prerequisites are missing

  • MPI set up for multi-node

​The following is an example of the version header information that is returned after running a TensorFlow container with the incorporated ENTRYPOINT using the docker run command.

$ docker run --runtime=nvidia --rm -it nvcr.io/nvidia/tensorflow:21.03-tf1 nvidia-smi

================
== TensorFlow ==
================
NVIDIA Release 21.03-tf1 (build 20726338)
TensorFlow Version 1.15.5
Container image Copyright (c) 2021, NVIDIA CORPORATION.  All rights reserved.
Copyright 2017-2021 The TensorFlow Authors.  All rights reserved.
NVIDIA Deep Learning Profiler (dlprof) Copyright (c) 2021, NVIDIA CORPORATION. All rights reserved.
Various files include modifications (c) NVIDIA CORPORATION.  All rights reserved.
This container image and its contents are governed by the NVIDIA Deep Learning Container License.
By pulling and using the container, you accept the terms and conditions of this license:
https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license
NOTE: Legacy NVIDIA Driver detected. Compatibility mode ENABLED.

Without using ENTRYPOINT in the CLI, there would be no banner information in the output.

This is shown in the following example of using NGC Base Command CLI to run nvidia-smi within the TensorFlow container without using ENTRYPOINT.

$ ngc base-command run \
--name "TensorFlow Demo" \
--preempt RUNONCE \
--min-timeslice 0s \
--total-runtime 0s \
--ace nv-eagledemo-ace \
--instance dgxa100.40g.1.norm \
--result /result \
--image "nvidia/tensorflow:21.03-tf1-py3" \
--commandline "nvidia-smi"

Initial lines of the output Log File (no TensorFlow header information is generated):

Thu Apr 15 17:32:02 2021
+-------------------------------------------------------------------+
| NVIDIA-SMI 450.80.02 Driver Version: 450.80.02 CUDA Version: 11.2 |
|---------------------+----------------------+----------------------+
...

13.2.1. Example Using Container ENTRYPOINT

To use the container ENTRYPOINT, use the --use-image-entrypoint argument.

Example:

$ ngc base-command run \
--name "TensorFlow Entrypoint Demo" \
--preempt RUNONCE \
--ace nv-eagledemo-ace \
--instance dgxa100.40g.1.norm \
--result /result \
--image "nvidia/tensorflow:21.03-tf1-py3" \
--use-image-entrypoint \
--commandline "nvidia-smi"

Output log file with TensorFlow header information, including initial lines of the nvidia-smi output.

================
== TensorFlow ==
================
NVIDIA Release 21.03-tf1 (build 20726338)
TensorFlow Version 1.15.5
Container image Copyright (c) 2021, NVIDIA CORPORATION. All rights reserved.
Copyright 2017-2021 The TensorFlow Authors. All rights reserved.
NVIDIA Deep Learning Profiler (dlprof) Copyright (c) 2021, NVIDIA CORPORATION. All rights reserved.
Various files include modifications (c) NVIDIA CORPORATION. All rights reserved.
This container image and its contents are governed by the NVIDIA Deep Learning Container License.
By pulling and using the container, you accept the terms and conditions of this license:
https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license
NOTE: Legacy NVIDIA Driver detected. Compatibility mode ENABLED.

Thu Apr 15 17:42:37 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.80.02 Driver Version: 450.80.02 CUDA Version: 11.2           |
|-------------------------------+----------------------+----------------------+
...

13.2.2. Example Using CLI ENTRYPOINT

You can also use the --entrypoint argument to specify an ENTRYPOINT to use that will override the container ENTRYPOINT.

The following is an example of specifying an ENTRYPOINT in the NGC base-command command to run nvidia-smi. This is instead of using the --commandline argument.

$ ngc base-command run \
--name "TensorFlow CLI Entrypoint Demo" \
--preempt RUNONCE \
--ace nv-eagledemo-ace \
--instance dgxa100.40g.1.norm \
--result /result \
--image "nvidia/tensorflow:21.03-tf1-py3" \
--entrypoint "nvidia-smi"

Initial lines of the output file.

Thu Apr 15 17:52:53 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.80.02 Driver Version: 450.80.02 CUDA Version: 11.0           |
|-------------------------------+----------------------+----------------------+
.. .

14. Tutorials

This chapter describes the tutorials that showcase various features of Base Command Platform (BCP). In this chapter, you will learn about ready-to-run tutorials available within the product for learning a workflow or for use as a basis for your custom workflow. This section also covers tutorials with sample commands or templates which can serve as a starting point for new users or new complex workflows.

Note

The ready-to-run tutorials are delivered as templates in nvbc-tutorials team context along with the required container images and data entities. Your org admin must add you to that team explicitly for you to be able to access these templates and run workloads based on those.

14.1. Launching a Job from Existing Templates

  1. Click BASE COMMAND > Jobs the left navigation menu and then click Create Job.

  2. Click the Templates tab.

    _images/create-job-templates.png
  3. Click the menu icon for the template to use, then select Apply Template.

    _images/apply-template.png

    The create a job page opens with the fields populated with the information from the job template.

  4. Verify the pre-filled fields, enter a unique name, then click Launch.

    _images/launch-job.png

14.2. Launching an Interactive Job with JupyterLab

From the existing templates, you can run the nvbc-jupyterlab template to pre-fill the job creation fields and launch an Interactive Job with jupyterLab. The following is an example of the CLI script for the same job script template.

$ ngc base-command run \
--name "NVbc-jupyterlab" \
--preempt RUNONCE \
--ace nv-eagledemo-ace \
--instance dgxa100.40g.1.norm \
--commandline “set -x; jupyter lab --NotebookApp.token='' --notebook-dir=/
                    --NotebookApp.allow_origin='*' & date; nvidia-smi; echo $NVIDIA_BUILD_ID; sleep
                    1d”
--result /result \
--image "nvidia/pytorch:21.02-py3" \
--org nv-eagledemo \
--team nvbc-tutorials \
--port 8888

14.3. Launching a Multi Node Interactive Job with JupyterLab

From the existing templates, you can run the nvbc-jupyterlab-mn template to pre-fill the job creation fields and launch an multinode Interactive Job with 2 nodes. The following is an example of the CLI script for the same job script template.

$ ngc base-command run \
--name "nvbc-jupyterlab-mn" \
--preempt RUNONCE \
--min-timeslice 0s
--total-runtime 36000s
--ace nv-eagledemo-ace \
--instance dgxa100.40g.8.norm \
--commandline “mpirun --allow-run-as-root -np 2 -npernode 1 bash -c ' set -x;
                    jupyter lab --NotebookApp.token='' --notebook-dir=/
                    --NotebookApp.allow_origin='*' & date; nvidia-smi; echo ; sleep
                  1d'”
--result /result \
--array-type "MPI"
--replicas "2"
--image "nvidia/pytorch:21.02-py3" \
--org nv-eagledemo \
--team nvbc-tutorials \
--port 8888

14.4. Getting Started with Tensorboard

Tensorboard is already installed by default on standard NGC containers. Perform the following to get started using TensorBoard

  1. Start a TensorFlow job.

    The following is an example using the NGC CLI.

    $ ngc base-command run \
    --name "NVbc-tensorboard" \
    --preempt RUNONCE \
    --ace nv-eagledemo-ace \
    --instance dgxa100.40g.1.norm \
    --commandline "set -x; jupyter lab --allow-root --NotebookApp.token='' --NotebookApp.allow_origin=* --notebook-dir=/ & date; tensorboard --logdir /workspace/logs/fit ; sleep 1d" \
    --result /result \
    --image "nvidia/tensorflow:21.08-tf1-py3" \
    --org nv-eagledemo \
    --team nvbc-tutorials \
    --port 8888 \
    --port 6006
    

    Once the container is running, the info page URL is mapped to ports 8888 and 6006.

  2. Login to the container via JupyterLab and open a terminal.

  3. Download the TensorBoard tutorial notebook.

    wget https://storage.googleapis.com/tensorflow_docs/tensorboard/docs/get_started.ipynb
    
  4. Open the downloaded notebook.

  5. Run the commands in the notebook until you get to command 6.

    tensorboard --logdir logs/fit
    
  6. Open the URL mapped to port 6006 on the container to open Tensorboard.

    The TensorBoard UI should appear similar to the following example.

    _images/tensorboard-ui.png

Refer to https://www.tensorflow.org/tensorboard/get_started for more information on how to use Tensorboard.

14.5. NCCL Tests

NCCL tests check both the performance and the correctness of NCCL operations and you can test out the performance between GPUs using the nvbc-MN-NCCL-Tests template. The following is an example of the CLI script for the same NCCL Test template. The Average Bus Bandwidth for a successful NCCL test is expected to be > 175GB.

$ ngc base-command run \
--name "nvbc-MN-NCCL-Tests" \
--preempt RUNONCE \
--total-runtime 86400s \
--ace nv-eagledemo-ace \
--instance dgxa100.40g.1.norm \
--commandline “bash -c 'for i in {1..20}; do echo \"******************** Run
                    ********************\"; mpirun -np ${NGC_ARRAY_SIZE} -npernode 1
                    /nccl-tests/build/all_reduce_perf -b 128M -e 2G -f 2 -t 8 -g 1;
                  done'”
--result /result \
--array-type “MPI” \
--replicas “2” \
--image "nv-eagledemo/mn-nccl-test:ibeagle" \
--org nv-eagledemo \
--team nvbc-tutorials

14.6. StyleGAN SingleNode Workload

From the existing templates, you can run the nvbc-stylegan-singlenode template to pre-fill the job creation fields and launch. The following is an example of the CLI script for StyleGAN single node workload with 8GPUs.

$ ngc base-command run \
--name "StyleGAN-singlenode" \
--preempt RUNONCE \
--min-timeslice 0s \
--ace nv-eagledemo-ace \
--instance dgxa100.40g.8.norm \
--commandline “python -u -m torch.distributed.launch --nproc_per_node=8
                    /mnt/workspace/train.py --snap=25 --data=/dataset --batch-size=32
                    --lr=0.002”
--result /output \
--image "nv-eagledemo/nvbc-tutorials/pytorch_stylegan:v1" \
--org nv-eagledemo \
--team nvbc-tutorials \
--datasetid 76731:/dataset

Here’s an example of the telemetry once the job is launched.

_images/ug-tut-stylegan-singlenode-workload-telemetry.png

14.7. StyleGAN MultiNode Workload

From the existing templates, you can run the nvbc-stylegan-multinode template to pre-fill the job creation fields and launch. The following is an example of the CLI script for the multinode StyleGAN workload with 2 Nodes.

$ ngc base-command run \
--name "StyleGAN-multinode" \
--preempt RUNONCE \
--min-timeslice 0s \
--total-runtime 230400s \
--ace nv-eagledemo-ace \
--instance dgxa100.40g.8.norm \
--commandline “mpirun --allow-run-as-root -np 2 -npernode 1 bash -c 'python
                    -u -m torch.distributed.launch --nproc_per_node=8
                    --master_addr=${NGC_MASTER_ADDR} --nnodes=${NGC_ARRAY_SIZE}
                    --node_rank=${NGC_ARRAY_INDEX} /mnt/workspace/train.py --snap=25 --data=/dataset
                    --batch-size=64 --lr=0.002'”
--result /output \
--array-type “MPI” \
--replicas “2” \
--image "nv-eagledemo/nvbc-tutorials/pytorch_stylegan3:pytorch.stylegan.v1"
                    \
--org nv-eagledemo \
--team nvbc-tutorials \
--datasetid 76731:/dataset

Here’s an example of the telemetry once the job is launched.

_images/ug-tut-stylegan-multinode-workload-telemetry-1200.png

14.8. Building a Dataset from S3 Cloud Storage

This section details an example of building a dataset with CLI and code from a cloud storage bucket.

Perform the following before starting.

  1. Identify credentials and location of the cloud storage bucket.

  2. Know the directory structure within the bucket.

  3. Create a workspace in Base Command Platform (typically dedicated as home workspace).

    Refer to Creating a Workspace Using the Base Command Platform CLI for instructions.

  4. Have a current job running to exec into or from which to run the following example.

14.8.1. Running a Job

  1. Start a Jupyter notebook job.

    Replace ACE, org, workspace, and team values arguments. The job will run for one hour.

    ngc base-command run --name "demo-s3-cli" --preempt RUNONCE --ace {ace-name} \
    --instance {instance-type} --commandline "jupyter lab --ip=0.0.0.0 --allow-root \
    --no-browser --NotebookApp.token='' \
    --notebook-dir=/ --NotebookApp.allow_origin='*' & date; sleep 1h" --result /results \
    --workspace {workspace-name}:/{workspace-name}:RW --image "nvidia/pytorch:21.07-py3" \
    --org {org-name} --team {team-name} --port 8888
    
  2. Once the job has started, access the JupyterLab terminal.

    ngc base-command info {id}
    --------------------------------------------------
    Job Information
      Id: 2233490
      ...
    Job Container Information
      Docker Image URL: nvidia/pytorch:21.07-py3
      Port Mappings
        Container port: 8888 mapped to https://tnmy3490.eagle-demo.proxy.ace.ngc.nvidia.com
        ...
    Job Status
      ...
      Status: RUNNING
      Status Type: OK
    --------------------------------------------------
    

    Alternatively, exec into the job through NGC CLI.

    _images/dataset-s3-cloud-running-job-ui.png

14.8.2. Creating a Dataset using AWS CLI

  1. Obtain, unzip, and install the AWS CLI zip file.

     curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip"
    unzip awscliv2.zip
    ./aws/install -i /usr/local/aws-cli -b /usr/local/bin
    
  2. Ensure there is access to the AWS CLI.

    aws --version
    
  3. Run through the AWS Configuration by inputting the Access Key ID and Secret Access Key.

    These can be found underneath AWS’s IAM user panel. Refer to additional AWS CLI documentation.

    aws configure
    AWS Access Key ID [None]: <ACCESS_KEY>
    AWS Secret Access Key [None]: <SECRET_ACCESS_KEY>
    Default region name [None]: us-west-2
    Default output format [None]: json
    
  4. Sync a bucket to the results folder to be saved as a dataset.

    aws s3 sync 's3://<source-bucket>' '../results'
    

Results should now be ready to be saved as a dataset. Refer to Managing Datasets for more information.

14.8.3. Creating a Dataset using AWS Boto3

Boto3 is the AWS SDK for accessing S3 buckets. This section will cover downloading a specific file from an S3 bucket and then saving it to a results folder. View more documentation regarding Boto3 here.

  1. Install Boto3 through pip and prepare imports in the first cell of the Jupyter notebook.

    !pip installboto3
    
    import boto3
    import io
    import os
    
  2. Initialize Boto3 with an AWS Access Key and Secret Access Key.

    Make sure IAM user settings has proper access and permissions to the needed S3 buckets.

     # Let's use Amazon S3 by initializing our Access Key and Secret Access Key
    s3 = boto3.resource('s3', aws_access_key_id=<ACCESS_KEY>,
    aws_secret_access_key=<SECRET_ACCESS_KEY>)
    
    bucket = s3.Bucket(<BUCKET_NAME>)
    

14.8.4. Downloading a File

Downloading a file is a function built within Boto3. It will need the Bucket Name, Object Name (referred to as a key), and the File Output Name. Refer to Amazon S3 Examples - Downloading files for additional information.

s3.download_file(<BUCKET_NAME>, <OBJECT_NAME>, <FILE_NAME>)

14.8.5. Downloading a Folder

The following includes a function for downloading a single-directory depth from an S3 bucket to BCP storage, either to /results mount of the job or to a Base Command Platform workspace mounted in the job.

def download_s3_folder(s3_folder, local_dir='../results/s3_bucket'):
    for obj in bucket.objects.filter(Prefix=s3_folder):
        target = obj.key if local_dir is None \
            else os.path.join(local_dir, os.path.relpath(obj.key, s3_folder))
        if not os.path.exists(os.path.dirname(target)):
            os.makedirs(os.path.dirname(target))
        if obj.key[-1] == '/':
            continue
        print(obj.key)
        bucket.download_file(obj.key, target)

To save a dataset or checkpoint from the /results mount, download the contents and then upload as a dataset as described in Converting a Checkpoint to a Dataset.

14.9. Using Data Loader for Cloud Storage

This document details an example of using a data loader from a cloud storage bucket. It is recommended that the CLI option is attempted before proceeding with the data loader as it will not save the folder hierarchy.

Perform the following before starting.

  1. Identify credentials and location of the cloud storage bucket.

  2. Know the directory structure within the bucket.

  3. Create a workspace in Base Command Platform (typically dedicated as home workspace).

    Refer to Creating a Workspace Using the Base Command Platform CLI for instructions.

14.9.1. Running and Opening JupyterLab

  1. Mount the workspace in the job.

  2. Replace ACE, org, workspace, and team arguments.

    ngc base-command run --name "demo-s3-dataloader" --preempt RUNONCE --ace {ace-name} \
    --instance {instance-type} --commandline "jupyter lab --ip=0.0.0.0 \
    --allow-root --no-browser --NotebookApp.token='' --notebook-dir=/ \
    --NotebookApp.allow_origin='*' & date; sleep 6h" --result /results \
    --workspace {workspace-name}:/mount/{workspace-name}:RW --image "nvidia/pytorch:21.07-py3" \
    --org {org-name} --team {team-name} --port 8888
    
  3. Open the link for the JupyterLab to access the UI.

    Do this by fetching the job’s information with the batch info command. Below is an example response with the mapped port. You can ctrl+left click the link in bold to access it in your browser.

    ngc base-command info {id}
    --------------------------------------------------
      Job Information
        Id: 2233490
        ...
      Job Container Information
        Docker Image URL: nvidia/pytorch:21.07-py3
        Port Mappings
           Container port: 8888 mapped to https://tnmy3490.eagle-demo.proxy.ace.ngc.nvidia.com
           ...
      Job Status
        ...
        Status: RUNNING
        Status Type: OK
    --------------------------------------------------
    

    You should now be prompted with options to create a file.

  4. Navigate into your workspace on the sidebar, and then click on Python 3 to create your file.

    _images/data-loader-jupyterlab.png

14.9.2. Utilizing the Cloud Data Loader for Training

Use the code for creating a Jupyter Notebook, with these changes:

  1. Do not issue import wandb.

  2. Add the following imports:

    # Imports
    !pip install boto3
    import boto3
    from botocore import UNSIGNED
    from bot ocore.config import Config
    
  3. Change the first line of #3.2.

    From this:

    (x_train, y_train), (x_test, y_test) = keras.datasets.mnist.load_data()
    

    To this:

    s3 = boto3.client('s3', config=Config(signature_version=UNSIGNED))
    bucket_name='mnist-testbucket'
    key='mnist_2.npz'
    
    s3_response_object = s3.get_object(Bucket=bucket_name, Key=key)
    object_content = s3_response_object['Body'].read()
    load_bytes = BytesIO(object_content)
    
    with np.load(load_bytes, allow_pickle=True) as f:
       x_train, y_train = f['x_train'], f['y_train']
       x_test, y_test = f['x_test'], f['y_test']
    
  4. Execute Step #3 through Step #6.

14.10. Launching an Interactive Job with Visual Studio Code

This tutorial section contains three options for installing and accessing Visual Studio Code for use with Base Command Platform:

  • Installing Visual Studio Code’s code-server in a container

  • Installing and running Visual Studio Code’s code-server at job runtime

  • Installing Visual Studio Code CLI in a job and starting a remote tunnel

14.10.1. Installing Visual Studio Code in a Container

This option details installing Visual Studio Code in a container, pushing the container to a private registry, then launching a job in Base Command Platform using the container so that VS Code is accessible using a web browser.

_images/vscode-job-overview.png
14.10.1.1. Building the Container

The following is a sample Dockerfile to create a container that can launch Visual Studio Code to be accessible via a web browser. It includes examples for downloading and installing extensions.

To build this container, you’ll need a system set up with Docker and the NVIDIA Container Toolkit. For more information, refer to the NVIDIA Container Toolkit documentation.

For more information, refer to the code-server documentation.

  1. Create a Dockerfile for the container and extensions we’ll need to install. A sample Dockerfile is provided below.In this case, we’re starting from the base TensorFlow container from NGC, but any container of your choice can be used.

    ARG FROM_IMAGE_NAME=nvcr.io/nvidia/tensorflow:22.04-tf2-py3
    FROM ${FROM_IMAGE_NAME}
    
    # Install code-server to enable easy remote development on a container
    # More info about code-server be found here: https://coder.com/docs/code-server/v4.4.0
    ADD https://github.com/coder/code-server/releases/download/v4.4.0/code-server_4.4.0_amd64.deb code-server_4.4.0_amd64.deb
    RUN dpkg -i ./code-server_4.4.0_amd64.deb && rm -f code-server_4.4.0_amd64.deb
    
    # Install extensions from the marketplace
    RUN code-server --install-extension ms-python.python
    
    # Can also download vsix files and install them locally
    ADD https://github.com/microsoft/vscode-cpptools/releases/download/v1.9.8/cpptools-linux.vsix cpptools-linux.vsix
    RUN code-server --install-extension cpptools-linux.vsix
    
    # Download vsix from: https://marketplace.visualstudio.com/items?itemName=NVIDIA.nsight-vscode-edition
    # https://marketplace.visualstudio.com/_apis/public/gallery/publishers/NVIDIA/vsextensions/nsight-vscode-edition/2022.1.31181613/vspackage
    COPY NVIDIA.nsight-vscode-edition-2022.1.31181613.vsix NVIDIA.nsight-vscode-edition.vsix
    RUN code-server --install-extension NVIDIA.nsight-vscode-edition.vsix
    
  2. From the directory containing the Dockerfile, run the following commands to build and push the container to the appropriate team and org.

    docker build -t nvcr.io/<org>/<team>/vscode-server:22.04-tf2 .
    docker push nvcr.io/<org>/<team>/vscode-server:22.04-tf2
    
14.10.1.2. Starting a Job
  1. Using the Web UI or NGC CLI, you can then run a job with the container. An example job command is provided below.

    This job command selects the VS Code container that we just built and pushed to our private registry. It provides a port mapping in BCP corresponding with the --bind-addr argument in the command, and provides the launch command with the necessary parameters to start VS Code. Note: The password to access the VS Code console is set as an environment variable in the commandline parameter. This environment variable should be set to a password of your choice.

    ngc base-command run \
      --name "run_vscode" \
      --ace <ace>\
      --org <org> \
      --team <team> \
      --instance dgxa100.40g.1.norm \
      --image "nvcr.io/<org>/<team>/vscode:22.04-tf2" \
      --port 8899 \
      --result /results \
      --total-runtime 1h \
      --commandline "\
    PASSWORD=mypass code-server --auth password --bind-addr 0.0.0.0:8899 /workspace & \
    sleep infinity"
    
  2. Once the job has been created and is running, open the Web UI for Base Command Platform. In the Overview page for the job, click the link mapped to the port for code-server (in the example it is 8899).

  3. Then in the new window, enter the password (mypass in the above example) to enter the Visual Studio Code IDE.

    _images/vscode-password-prompt.png
  4. VS Code should come up after the password prompt. It might require a few quick setup steps such as trusting files/directories added to the VS Code, theme layout, etc. Once VS Code is up and running, you can edit files, and with Python and Cpp + Nsight extensions already installed, IntelliSense should also work.

    _images/vscode-intellisense-demo.png

14.10.2. Adding Visual Studio Code Capability at Runtime

You can also install and run Visual Studio Code at runtime when launching an existing image.

The following example shows the NGC CLI command to install and launch Visual Studio Code as --commandline arguments for a Base Command job, using the nvidia/pytorch image.

ngc base-command run --image nvidia/pytorch:22.05-py3 --port 8899 \
--name "run_vscode" \
--ace <ace>\
--org <org> \
--team <team> \
--instance dgxa100.40g.1.norm \
--result /results \
--total-runtime 1h \
--commandline "wget -nc https://github.com/coder/code-server/releases/download/v4.4.0/code-server_4.4.0_amd64.deb -o code-server_4.4.0_amd64.deb && dpkg -i ./code-server_4.4.0_amd64.deb && PASSWORD=mypass code-server --auth password --bind-addr 0.0.0.0:8899"

14.10.3. Setting Up and Accessing Visual Studio Code via Remote Tunnel

This option is the simplest and most straightforward option for setting up and accessing Visual Studio Code from an already running Base Command Platform job, as it does not require port mappings to be configured at job runtime.

It leverages VS Code’s Remote Tunnels functionality, where we will install VS Code CLI in the job’s container, then create a remote tunnel for VS Code to the job that can be accessed through a web browser or your own VS Code instance.

  1. Within the job, run the following commands to download and extract the VS Code CLI.

    curl -Lk 'https://code.visualstudio.com/sha/download?build=stable&os=cli-alpine-x64' --output vscode_cli.tar.gz
    
    tar -xf vscode_cli.tar.gz
    

    You can create either a Dockerfile to build your own container image with this already installed, as described in the first example, or you can install this at runtime, as in the the previous example.

    To install this in an already running job, you can exec into the job using the following command, then run the above commands.

    $ ngc base-command exec <job_id>
    
  2. Once the CLI has been installed in the container and/or job, exec into the job, then run the below command. Follow the prompts to authenticate, and open the link provided to access VS Code from your browser.

    root@5517702:/job_workspace# ./code tunnel
    *
    
    
          Visual Studio Code Server
    
          By using the software, you agree to
          the Visual Studio Code Server License Terms (https://aka.ms/vscode-server-license) and
          the Microsoft Privacy Statement (https://privacy.microsoft.com/en-US/privacystatement).
    
    
    ✔ How would you like to log in to Visual Studio Code? · Microsoft Account
    To sign in, use a web browser to open the page https://microsoft.com/devicelogin and enter the code EM2SACRJT to authenticate.
    
    ✔ What would you like to call this machine? · BCP-5517702
    [2023-11-28 17:29:46] info Creating tunnel with the name: bcp-5517702
    
    Open this link in your browser https://vscode.dev/tunnel/bcp-5517702/job_workspace
    

14.11. Running DeepSpeed

DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective. This section details how to launch a DeepSpeed example job on Base Command Platform.

14.11.1. Creating the DeepSpeed Container

The following is a sample Dockerfile to create a container image for a specific version of DeepSpeed. The NVIDIA PyTorch container image is used as the base image to provide the required PyTorch dependencies for DeepSpeed.

  1. Define the container image:

     1# Example Dockerfile for building a DeepSpeed image
     2ARG FROM_IMAGE_NAME=nvcr.io/nvidia/pytorch:23.12-py3
     3FROM ${FROM_IMAGE_NAME}
     4
     5ENV TORCH_CUDA_ARCH_LIST="8.0 8.6 9.0+PTX"
     6
     7# libaio-dev required for async-io
     8# https://www.deepspeed.ai/docs/config-json/#asynchronous-io
     9RUN apt update && \
    10   apt install -y --no-install-recommends libaio-dev
    11
    12RUN pip install --upgrade pip setuptools wheel && \
    13   pip config set global.disable-pip-version-check true
    14
    15RUN cd /opt && \
    16   pip list | \
    17      awk '{print$1"=="$2}' | \
    18      tail +3 > pip_constraints.txt
    19
    20RUN pip install --upgrade pip && \
    21   pip install \
    22      triton \
    23      ninja \
    24      hjson \
    25      py-cpuinfo
    26
    27RUN python -m pip install --no-cache-dir -i https://pypi.anaconda.org/mpi4py/simple mpi4py
    28
    29RUN cd /opt && \
    30   git clone https://github.com/microsoft/DeepSpeed.git && \
    31   cd DeepSpeed && \
    32   git checkout v0.12.6 && \
    33   find . -type f -not -path '*/\.*' -exec \
    34      sed -i 's%std=c++14%std=c++17%g' {} + && \
    35   pip install pydantic==1.10.13 && \
    36   pip install -c /opt/pip_constraints.txt deepspeed-kernels && \
    37   DS_BUILD_OPS=1 DS_BUILD_SPARSE_ATTN=0 DS_BUILD_EVOFORMER_ATTN=0 \
    38      pip install -vvv --no-cache-dir --global-option="build_ext" .
    
  2. Build then push the container image using your BCP org private registry identifier as necessary. For example:

    docker build -t nvcr.io/<your private org>/pytorch-deepspeed:0.12.6 -f Dockerfile .
    
  3. After building and storing the image in your org’s private registry, you’ll need a script to launch a DeepSpeed example. We recommend using the CIFAR-10 tutorial in the DeepSpeed examples repo on GitHub.

     1#!/bin/bash
     2# file: run_cifar10_deepspeed.sh
     3
     4# Example reference code:
     5# https://github.com/microsoft/DeepSpeedExamples/blob/master/training/cifar/cifar10_deepspeed.py
     6
     7cd /deepspeed_scratch
     8
     9if [ ! -d DeepSpeedExamples ]; then
    10git clone \
    11--single-branch \
    12--depth=1 \
    13# tested using sha dd0f181
    14# if necessary, do a deep clone then
    15# git reset --hard dd0f181
    16--branch=master \
    17https://github.com/microsoft/DeepSpeedExamples.git ;
    18fi
    19
    20export CODEDIR=/deepspeed_scratch/DeepSpeedExamples
    21
    22# Patch a bug:
    23# https://github.com/microsoft/DeepSpeedExamples/issues/222
    24sed -i 's%images, labels = dataiter.next()%images, labels = next(dataiter)%g' \
    25${CODEDIR}/training/cifar/cifar10_deepspeed.py && \
    26
    27deepspeed \
    28--launcher openmpi \
    29--launcher_args="--allow-run-as-root" \
    30--hostfile="/etc/mpi/hostfile" \
    31--master_addr launcher-svc-${NGC_JOB_ID} \
    32--no_ssh_check \
    33${CODEDIR}/training/cifar/cifar10_deepspeed.py
    
  4. After creating the launch script, upload it to the designated workspace within the ACE that you’ve already created. For example:

    ngc workspace upload --ace <your ace> --org <your org> --team <your team> --source run_cifar10_deepspeed.sh <your workspace>
    

    Note

    An alternative technique would be to include the script as part of the container image build described earlier. By uploading to a workspace, you decouple the lifecycle of the launch script from that of the image which would be preferable in most cases.

  5. Now you are ready to create a BCP job to launch the DeepSpeed training example. Assuming you used the same mount point as prescribed in the launch script (“deepspeed_scratch”), you can create a new job using the NGC CLI tool with this command:

     1ngc base-command run \
     2--name "run_cifar10_deepspeed" \
     3--org <your org> \
     4--team <your team> \
     5--ace <your ace> \
     6--instance dgxa100.80g.8.norm \
     7--array-type "PYTORCH" \
     8--replicas <node count> \
     9--image "<container with deepspeed installed>" \
    10--result /results \
    11--workspace <your workspace>:/deepspeed_scratch:RW \
    12--total-runtime 15m \
    13--commandline "bash /deepspeed_scratch/run_cifar10_deepspeed.sh"
    

    Alternatively, you can also run the DeepSpeed example Python script using the bcprun tool. bcprun wraps the orchestration of MPI and distributed PyTorch jobs, simplifying many of the number of arguments required for launch. For your DeepSpeed job, you would replace the previous command argument with a variation of this:

    1bcprun \
    2--nnodes $NGC_ARRAY_SIZE \
    3--npernode $NGC_GPUS_PER_NODE \
    4--env CODEDIR="/deepspeed_scratch/DeepSpeedExamples/training/cifar" \
    5--cmd "python \${CODEDIR}/cifar10_deepspeed.py"
    

15. Using NVIDIA Base Command Platform with Weights & Biases

15.1. Introduction

NVIDIA Base Command™ Platform is a premium infrastructure solution for businesses and their data scientists who need a world-class artificial intelligence (AI) development experience without the struggle of building it themselves. Base Command Platform provides a cloud-hosted AI environment with a fully managed infrastructure.

In collaboration with Weights & Biases (W&B), Base Command Platform users now have access to the W&B machine learning (ML) platform to quickly track experiments, version and iterate on datasets, evaluate model performance, reproduce models, visualize results, and spot regressions, and share findings with colleagues.

This guide explains how to get started with both Base Command Platform and W&B, as well as walks through a quick tutorial with an exemplary deep learning (DL) workflow on both platforms.

15.2. Setup

15.2.1. Base Command Platform Setup

  1. Set up a Base Command Platform account.

    Ask your team admin to add you to the team or org you want to join. After being added, you will receive an email invitation to join NVIDIA Base Command. Follow the instructions in the email invite to set up your account. Refer to the section Onboarding and Signup for more information on setting the context and configuring your environment

  2. While logging in to the web UI, install and setup the CLI.

    Follow instructions at https://ngc.nvidia.com/setup/installers/cli. The CLI is supported for Linux, Windows, and MacOS.

  3. Generate an API key.

    Once logged into Base Command Platform, go to the API key page and select “Generate API Key”. Store this key in a secure place. The API key will also be used to configure the CLI to authenticate your access to NVIDIA Base Command Platform.

  4. Set the NGC context.

    Use the CLI to log in and enter your API key and setting preferences. The key will be stored for future commands.

    ngc config set

    You will be prompted to enter your API key and then your context, which is your org/team (if teams are used), and the ace. Your context in NGC defines the default scope you operate in for collaboration with your team members and org.

15.2.2. Weights and Biases Setup

  1. Access Weights & Biases.

    Your Base Command Platform subscription automatically provides you with access to the W&B Advanced version. Create and set up credentials for your W&B account as your Base Command Platform account is not directly integrated with W&B – that is, W&B cannot be accessed with your Base Command Platform credentials.

  2. Create a private workspace on Base Command Platform.

    Using a private workspace is a convenient option to store your config files or keys so that you can access those in read-only mode from all your Base Command workloads. TIP: Name the workspace “homews-<accountname >” for consistency. Set your ACE and org name – here, “nv_eagledemo-ace” and “nv-eagledemo”.

    ngc workspace create --name homews-<accountname> --ace nv-eagledemo-ace --org nv-eagledemo
    
  3. Access your W&B API key.

    Once the account has been created, you can access your W&B API key via your name icon on the top of the page → “Settings” → “API keys”. Refer to the “Execution” section for additional details on storing and using the W&B API key in your runs.

15.2.3. Storing W&B Keys in Base Command Platform

Your workload running on Base Command Platform must specify the credentials and configuration for your W&B account, for tracking jobs and experiments. Saving the W&B key in a Base Command Platform workspace needs to be performed only one time. The home workspace can be mounted to any Base Command Platform workload to access the previously recorded W&B key. This section shows how to generate and save W&B API key to your workspace.

Users have two options to configure the W&B API key to the private home workspace.

15.2.3.1. Option 1 | Using a Jupyter Notebook
  1. Run an interactive JupyterLab job on Base Command Platform with the workspace mounted into the job.

    In our example, we use homews-demouser as workspace. Make sure to replace the workspace name and context accordingly for your own use.

    CLI:

    ngc base-command run --name 'wandb_config' --ace nv-eagledemo-ace --instance dgxa100.40g.1.norm \
    --result /results --image "nvidia/tensorflow:21.06-tf2-py3" --org nv-eagledemo \
    --team nvtest-demo --workspace homews-demouser:/homews-demouser:RW --port 8888 \
    --commandline "pip install wandb; jupyter lab --allow-root --ip=0.0.0.0 --no-browser --NotebookApp.token='' --NotebookApp.allow_origin='*' --notebook-dir=/"
    

    Note that the home workspace (here, homews-demouser) is mounted in read / write mode.

  2. When the job is running, start a session by clicking on the JupyterLab URL (as displayed on the “Overview” tab within a job).

  3. Create new Jupyter notebook (e.g., “config”) and copy the following script into the notebook.

    import wandb
    import os
    import requests
    # 1. Login to W&B interactively to specify the API key
    wandb.login()
    # 2. Create a directory for configuration files
    !mkdir -p /homews-demouser/bcpwandb/wandbconf
    # 3. Copy the file into the configuration folder
    !cp ~/.netrc /homews-demouser/bcpwandb/wandbconf/config.netrc
    # 4. Set the login key to the stored W&B API key
    os.environ["NETRC"]= "/homews-demouser/bcpwandb/wandbconf/config.netrc"
    # 5. Check current W&B login status and username. Validate the correct API key
    # The command will output {"email": "xxx@wandb.com", "username": "xxxx"}
    res = requests.post("https://api.wandb.ai/graphql", json={"query": "query Viewer { viewer { username email } }"}, auth=("api", wandb.api.api_key))
    res.json()["data"]["viewer"]
    

    The W&B API key is now stored in the home workspace (homews-demouser).

15.2.3.2. Option 2 | Using a Script (via curl Command)
  1. Run an interactive JupyterLab job on Base Command Platform with the workspace mounted into the job.

    In our example, we use homews-demouser as workspace. Make sure to replace the workspace name and context accordingly for your own use.

    CLI:

    ngc base-command run --name 'wandb_config' --ace nv-eagledemo-ace --instance dgxa100.40g.1.norm \
    --result /results --image "nvidia/tensorflow:21.06-tf2-py3" --org nv-eagledemo \
    --team nvtest-demo --workspace homews-demouser:/homews-demouser:RW --port 8888 \
    --commandline "pip install wandb; jupyter lab --allow-root --ip=0.0.0.0 --no-browser --NotebookApp.token='' --NotebookApp.allow_origin='*' --notebook-dir=/"
    

    Note that the home workspace (here, homews-demouser) is mounted in read / write mode.

  2. When the job is running, start a session by clicking on the JupyterLab URL (as displayed on the “Overview” tab within a job).

  3. Start a terminal in JupyterLab and execute the following commands to create user credentials.

    Make sure to replace the workspace name and context accordingly for your own use.

    Terminal:

    $ pip install wandb
    $ curl -sL https://wandb.me/bcp_login | python - config <API key>
    $ mkdir -p /homews-demouser/bcpwandb/wandbconf
    $ cp config.netrc /homews-demouser/bcpwandb/wandbconf/config.netrc
    $ python -c "os.environ["NETRC"]= "/homews-demouser/bcpwandb/wandbconf/config.netrc"
    

    Terminal output: ‘API key written to config.netrc, use by specifying the path to this file in the NETRC environment variable’.

    This command will create a configuration directory in your home workspace and store the W&B API key it in this workspace (homews-demouser) via a configuration file.

15.3. Using W&B with a JupyterLab Workload

After having followed the previous steps, the W&B API key is securely stored in a configuration file within your private workspace (here, homews-demouser). Now, this private workspace must be attached to a Base Command Platform workload to use the W&B account and features.

In the section below, you will create a JupyterLab notebook as an example to show the stored API key. MNIST handwritten digits classification using a Convolutional Neural Network with TensorFlow and Keras is an easily accessible, open-source model and dataset that we will use for this workflow (available via Keras here).

15.3.1. Create a Jupyter Notebook, Including W&B Keys for Experiment Tracking

Follow the first two steps in either option under Storing W&B Keys in Base Command Platform to create a job on Base Command Platform. After having accessed JupyterLab via the URL, start a new Jupyter notebook with the code below and save it as a file in your private workspace (/homews-demouser/bcpwandb/MNIST_example.ipynb).

The following exemplary script imports required packages, sets the environment, and initializes a new W&B run. Subsequently, it builds, trains, and evaluates the Convnet model with TensorFlow and Keras, as well as tracks several metrics with W&B.

# Imports
!pip install tensorflow
import numpy as np
from tensorflow import keras
from tensorflow.keras import layers
import wandb
import os

# 1. Import the W&B API key from private config workspace by defining NETRC fileos.environ["NETRC"]= "/homews-demouser/bcpwandb/wandbconf/config.netrc"

# 2. Initialize the W&B run
wandb.init(project = "nvtest-repro", id = "MNIST_run_epoch-128_bs-15", name = "NGC-JOB-ID_" + os.environ["NGC_JOB_ID"])

# 3. Prepare the data
# 3.1 Model / data parameters
num_classes = 10
input_shape = (28, 28, 1)

# 3.2 Split data between train and test sets
(x_train, y_train), (x_test, y_test) = keras.datasets.mnist.load_data()
x_train = x_train.astype("float32") / 255
x_test = x_test.astype("float32") / 255

# 3.3 Make sure images have the shape (28, 28, 1)
x_train = np.expand_dims(x_train, -1)
x_test = np.expand_dims(x_test, -1)
print("x_train shape:", x_train.shape)
print(x_train.shape[0], "train samples")
print(x_test.shape[0], "test samples")

# 3.4 Convert class vectors to binary class matrices
y_train = keras.utils.to_categorical(y_train, num_classes)
y_test = keras.utils.to_categorical(y_test, num_classes)

# 4. Build the model
model = keras.Sequential(
    [
        keras.Input(shape=input_shape),
        layers.Conv2D(32, kernel_size=(3, 3), activation="relu"),
        layers.MaxPooling2D(pool_size=(2, 2)),
        layers.Conv2D(64, kernel_size=(3, 3), activation="relu"),
        layers.MaxPooling2D(pool_size=(2, 2)),
        layers.Flatten(),
        layers.Dropout(0.5),
        layers.Dense(num_classes, activation="softmax"),
    ]
)
model.summary()

# 5. Train the model
batch_size = 128
epochs = 15

model.compile(loss="categorical_crossentropy", optimizer="adam", metrics=["accuracy"])
model.fit(x_train, y_train, batch_size=batch_size, epochs=epochs, validation_split=0.1)

# 6. Evaluate the trained model
score = model.evaluate(x_test, y_test, verbose=0)
print("Test loss:", score[0])
print("Test accuracy:", score[1])

# 7. Track metrics with wandb
wandb.log({'loss': score[0], 'accuracy': score[1]})

# 8. Track training configuration with wandb
wandb.config.batch_size = batch_size
wandb.config.epochs = epochs

After this step, your home workspace (homews-demouser) will include the configuration file and the exemplary Jupyter notebook created above.

  • Home workspace: /homews-demouser

  • Configuration file: /homews-demouser/bcpwandb/wandbconf/config.netrc

  • Jupyter notebook: /homews-demouser/bcpwandb/MNIST_example.ipynb

15.3.2. Running a W&B Experiment in Batch Mode

After having successfully completed all steps, including 4.1., proceed to run a W&B experiment in batch mode. Make sure to replace the workspace name and context accordingly for your own use.

Run Command:

ngc base-command run --name "MNIST_example_batch" --ace nv-eagledemo-ace --instance dgxa100.40g.1.norm \
--commandline "pip install wandb; jupyter lab --allow-root --ip=0.0.0.0 --no-browser --NotebookApp.token='' --NotebookApp.allow_origin='*' --notebook-dir=/ & date; \
cp /homews-demouser/bcpwandb/MNIST_example.ipynb /results && \
touch /results/nb-executing && \
jupyter nbconvert --execute --to=notebook --inplace -y --no-prompt --allow-errors --ExecutePreprocessor.timeout=-1 /results/MNIST_example.ipynb; \
sleep 2h" \
--result /results --image "nvidia/tensorflow:21.06-tf2-py3" --org nv-eagledemo --team nvtest-demo \
--workspace homews-demouser:/homews-demouser:RO --port 8888
  • pip install wandb ensures that the wandb package is installed before the job is launched.

  • The command jupyter nbconvert --execute ... in the --commandline arg will automatically execute the Jupyter notebook after the job launches.

After completion of the job, the results can be accessed on the W&B dashboard which provides an overview of all projects of a given user (here, nv-testuser). Within a W&B project, users can compare the tracked metrics (here, accuracy and loss) between different runs.

_images/wab-1.png _images/wab-2.png

15.4. Best Practices for Running Multiple Jobs Within the Same Project

W&B only recognizes a new run upon a change in the run ID within the wandb.init( ) command. When only changing the run name, W&B will simply override the already existing run that has the same run ID. Alternatively, to log and track a new run separately, users can keep the same run ID but need to define the new run within a new project.

Runs can be customized within the wandb.init( ) command as follows:

wandb.init(project = "nvtest-demo", id = "MNIST_run_epoch-128_bs-15", name = "NGC-JOB-ID_" + os.environ["NGC_JOB_ID"])
  • Project: The W&B project name should correspond to your Base Command Platform team name. In this example, the Base Command Platform team name “nvtest-demo” is reflected as project name on W&B.

    Team name on Base Command Platform:

    _images/wab-3.png

    Project name on W&B:

    _images/wab-4.png
  • ID: The ID is unique to each run. It must be unique in a project and if a run is deleted, the ID can’t be reused. Refer to the W&B documentation for additional details. In this example, the ID is named after the Jupyter notebook and model configuration.

  • Name: The purpose of the run name is to identify each run in the W&B UI. In this example, we name each run according to the related NGC job ID and therefore ensure that each individual run has a different name to ensure easy differentiation between runs.

15.5. Supplemental Reading

Refer to other chapters in this document as well as the Weights & Biases documentations for additional information and details.

16. Deregistering

This chapter describes the features and procedures for de-registering users from the system.

Only org administrators can de-register users and remove artifacts (datasets, workspaces, results, container images, models etc). All artifacts owned by the user must be removed or archived before removing the user from the system.

Perform the following actions:

16.1. Remove all workspaces, datasets, and results

  • To archive, download each item:

    • ngc workspace download <workspace-id> --dest <path>
      
    • ngc dataset download <dataset-id> --dest <path>
      
    • ngc result download <result-id> --dest <path>
      
  • To remove the items:

    • ngc workspace remove <workspace-id>
      
    • ngc dataset remove <dataset-id>
      
    • ngc result remove <result-id>
      

16.2. Remove all container images, charts, and resources

  • To archive, download each item:

    • ngc registry image pull <repository-name>:<tag>
      
    • ngc registry chart pull <chart-name>:<version>
      
    • ngc registry resource download-version <resource-name>:<version>
      
  • To remove items:

    • ngc registry image remove <repository-name>:<tag>
      
    • ngc registry chart remove <chart-name>:<version>
      
    • ngc registry resource remove <resource-name>:<version>
      

16.3. Delete Users

  • list users in the current team:

    ngc team list-users
    
  • Remove each user from the team:

    ngc team remove-user <user-email>
    

16.4. Delete Teams

Once all users in a team have been removed, delete the team:

ngc org remove-team <team-name>

17. Best practices

This chapter contains best practices for working with Base Command Platform.

17.1. Data Management Best Practices

17.1.1. Understanding Data Movement Costs

The following is a guide to the different locations where data may reside:

Data Locations

Name

Definition

DGX Cloud

DGX Cloud is a service operated at one of our Cloud Service Provider (CSP) partner locations. Data is stored in the customer’s ACE on a high-speed parallel file system in the form of /datasets, /workspaces, and /results accessed via BCP and mounted during a job.

DGX Cloud Staging

DGX Cloud Staging is an NVIDIA-provisioned object storage blob colocated with the customer’s ACE provisioned for their DGX Subscription. It is provided to allow customers to begin uploading their data over the internet to the DGX Cloud data center before their subscription starts. Once the subscription has started, customers can import that data into the BCP /datasets and /workspaces.

It is not intended for long-term use and is only available for a short period at the start of the DGX Cloud subscription.

BCP On-Premises

A DGX SuperPOD at a customer’s premises or colocation facility with Base Command Platform deployed on it for management through the BCP interface. Storage on a SuperPOD is on one of the SuperPOD storage partner products.

3rd Party Different Cloud

Data that resides in a customer’s account on a CSP that differs from the CSP used for the DGX Cloud subscription location.

3rd Party Same Cloud

Data that resides in a customer’s account on a CSP that is the same CSP used for the DGX Cloud subscription location.

3rd Party On-Premises

Data that resides at a customer account colocated with their SuperPOD but not the primary storage of the SuperPOD.

3rd Party Off-Premises

Data not colocated with a BCP On-premises installation and unrelated to DGX Cloud or DGX Cloud Staging.

Please note the following data transfer cost considerations:

Data Transfer Cost

From

To

Cost

DGX Cloud

DGX Cloud Staging

Free

DGX Cloud

BCP On-premises

Included up to your DGX Cloud subscription egress limit

DGX Cloud

3rd Party Different Cloud

Included up to your DGX Cloud subscription egress limit

DGX Cloud

3rd Party Same Cloud

Same Cloud Inter-VPC fees

Same Cloud Multi-Region fees

DGX Cloud

Customer provided location

Included up to your DGX Cloud subscription egress limit

DGX Cloud Staging (Onboarding)

DGX Cloud

Included

DGX Cloud Staging (Onboarding)

BCP On-prem

Not applicable

DGX Cloud Staging (Offboarding)

Customer provided location

Arranged upon request

BCP On-Premises

DGX Cloud

Customer’s internet service egress

No DGX Cloud ingress fees

BCP On-Premises

DGX Cloud Staging

Customer’s internet service egress fee

BCP On-Premises

3rd Party On-premises

Customer internal

BCP On-Premises

3rd Party Off-premises

Customer’s internet service egress fee

3rd Party On-premises

DGX Cloud

Customer’s internet service egress fee

No DGX Cloud ingress fees

3rd Party On-premises

DGX Cloud Staging

Customer’s internet service egress fee

No DGX Cloud ingress fees

3rd Party On-premises

BCP On-premises

Customer internal

3rd Party Different Cloud

DGX Cloud

3rd Party Different Cloud egress fees

3rd Party Different Cloud

DGX Cloud Staging

3rd Party Different Cloud egress fees

3rd Party Same Cloud

DGX Cloud

Same Cloud Inter-VPC fees

Same Cloud Multi-Region fees

3rd Party Same Cloud

DGX Cloud Staging

Same Cloud Inter-VPC fees

Same Cloud Multi-Region fees

3rd Party Off-premises

BCP On-premises

3rd Party Off-premises egress fees

When transferring data into (ingress) DGX Cloud, there is no fee from DGX Cloud. The customer may have Customer Internet Service Egress from the service provider hosting their data - whether that be a cloud with explicit egress charges or on-premises with the internet service provider’s egress to internet charges.

In some circumstances, a pre-authenticated URL in the same region as a customer’s DGX Cloud instance can be provided to facilitate Staging for bulk transfers to DGX Cloud. This is used for the initial upload of datasets prior to a DGX Cloud subscription. Further, this DGX Cloud Staging area can be used to do a DGX Cloud data migration:

(Region A) -> DGX Cloud Staging (Region B) -> DGX Cloud (Region B)

There is no cost in transferring data between the DGX Cloud Staging object store and DGX Cloud BCP storage. However, the DGX Cloud Staging object store is intended for short-term use and is provided for customers’ convenience during limited periods of onboarding and off-boarding.

A customer may provide their own object store and use it directly for an AI Training job instead of using BCP datasets. The customer may also use their own object store to backup the training results if desired.

17.1.2. Deciding Whether to Import Data into BCP

Jobs can use datasets that are internal or external to BCP. For example, a job could run a container with direct access to the user’s S3 bucket in AWS.

Leaving data outside BCP

During experiment-based work, keeping your existing data in its current location may be cost-effective. The job runs inside BCP but accesses object-stored data elsewhere.

Bringing data into BCP

Customers may choose to bring their datasets into BCP for improved job performance. BCP-supporting environments (e.g., DGX Cloud, DGX SuperPOD) have performance-optimized filesystems to hold their datasets. This performance optimization supports cluster-wide parallel reads for large-scale training jobs.

During production work–or anytime more formal tracking is required–bringing the datasets into BCP provides several benefits: job tracking, job reproducibility, and in-platform, role-based dataset sharing.

Data Location Suggestions by Scenario

Scenario

Suggested Data Location

Notes

Only some of the dataset consumers are using BCP

Original, external storage location

Leaving data in the original location keeps it in a centralized location, preventing the need for synchronization mechanisms.

Frequent Data Updates

Original, external storage location

Leaving data in the original location ensures that everyone on the team is working with the most current and consistent data, preventing potential versioning issues.

Large Volume of Data

Original, external storage location

For exceptionally large datasets (e.g., petabytes), transferring data out of its current storage might be impractical or infeasible. However, since there is no cost to read a BCP dataset during a BCP job (i.e., no “GET” fee), transferring the dataset into BCP and accessing it repeatedly for free may be more cost-effective.

Low-Latency Requirements

Inside BCP

Having data colocated with compute hardware during a job offers the lowest possible latency for data access.

Job Reproducibility, Validation, or Auditability

Inside BCP

Having a job’s data located in BCP means the full dataset information is logged, and the job is 100% reproducible and reportable.

Shared “Official” Datasets

Inside BCP

If the organization has official, unchanging datasets–perhaps even production-level datasets–they’d like to share across many users, moving them into BCP is efficient.

17.1.3. Cost-efficient Data Management Outside BCP

Region-to-region transfer within the same cloud provider is generally lower cost than multi-region transfers or multi-cloud transfers. If you need to leave your data outside of BCP, having your data within the same region as your ACE may lower your cost.

17.1.4. Cost-efficient Data Management Inside BCP

If your workflow permits batching of data, you may reduce the number of egress requests. You can monitor available storage in the BCP Dashboard.

17.1.5. Cost-efficient Data Retrieval from BCP

If you’d like to move the contents of /results outside BCP after a job is complete, results data could be bundled and compressed (tar+gzip for example) to reduce the total amount of data to be transferred. That method reduces both size and number of transactions compared to transferring a high volume of smaller individual files. This method can lower the access costs (PUTS) to remote object stores as well as the time to egress and the cost.

17.1.6. Exercising Caution When Editing Existing Datasets

Dataset names must be unique across an ACE. So, if you try to add a dataset with a name that already exists in your ACE, you can either “append” this second dataset to the first dataset or cancel the import/upload.

Appending datasets permanently alters the existing dataset resource. Repeating and validating experiments, however, often requires reference to the exact dataset that was used originally. So, appending to existing datasets will invalidate those downstream tasks from previous jobs tasks.

If you do choose to append to the existing dataset in BCP,

  • any files with names not already in the BCP dataset will be added.

  • any files with names already in the BCP dataset will overwrite the original.

17.1.7. Leveraging ‘no-team’ for Resource Sharing

To share datasets and workspaces with your entire Organization, use the team argument “no team” instead. The absence of a specific team identifier, will share that resource at the Organization level.

Datasets shared with “no-team” will be available for all users in that Organization to view, mount during a job, and export.

Workspaces shared with “no-team” will be available for all users in that Organization to view, mount during a job, augment, and export.

Please work with your BCP Organization Administrator if you have questions about your organization’s best practices around sharing datasets and workspaces to the Organization.

17.1.8. Monitoring Per-User Data Quota

Monitoring user storage usage ensures users don’t suddenly hit their storage quota and become constrained on dataset movement. Users can select the Request Storage button (if enabled by your Organization) to request an increase in storage.

Users can check storage quota using the BCP Dashboard in the Web UI:

Monitoring storage quota

Users can also check their storage quota using the CLI:

$ ngc user storage

References

NGC CLI Documentation

With NVIDIA GPU Cloud (NGC) CLI, you can perform many of the same operations that are available from the NGC web application, such as running jobs, viewing Docker repositories and downloading AI models within your organization and team space.

NGC API Documentation

The NGC Web API is an interface for querying information from and enacting change in an NGC environment.

Notices

Notice

This document is provided for information purposes only and shall not be regarded as a warranty of a certain functionality, condition, or quality of a product. NVIDIA Corporation (“NVIDIA”) makes no representations or warranties, expressed or implied, as to the accuracy or completeness of the information contained in this document and assumes no responsibility for any errors contained herein. NVIDIA shall have no liability for the consequences or use of such information or for any infringement of patents or other rights of third parties that may result from its use. This document is not a commitment to develop, release, or deliver any Material (defined below), code, or functionality.

NVIDIA reserves the right to make corrections, modifications, enhancements, improvements, and any other changes to this document, at any time without notice.

Customer should obtain the latest relevant information before placing orders and should verify that such information is current and complete.

NVIDIA products are sold subject to the NVIDIA standard terms and conditions of sale supplied at the time of order acknowledgement, unless otherwise agreed in an individual sales agreement signed by authorized representatives of NVIDIA and customer (“Terms of Sale”). NVIDIA hereby expressly objects to applying any customer general terms and conditions with regards to the purchase of the NVIDIA product referenced in this document. No contractual obligations are formed either directly or indirectly by this document.

NVIDIA products are not designed, authorized, or warranted to be suitable for use in medical, military, aircraft, space, or life support equipment, nor in applications where failure or malfunction of the NVIDIA product can reasonably be expected to result in personal injury, death, or property or environmental damage. NVIDIA accepts no liability for inclusion and/or use of NVIDIA products in such equipment or applications and therefore such inclusion and/or use is at customer’s own risk.

NVIDIA makes no representation or warranty that products based on this document will be suitable for any specified use. Testing of all parameters of each product is not necessarily performed by NVIDIA. It is customer’s sole responsibility to evaluate and determine the applicability of any information contained in this document, ensure the product is suitable and fit for the application planned by customer, and perform the necessary testing for the application in order to avoid a default of the application or the product. Weaknesses in customer’s product designs may affect the quality and reliability of the NVIDIA product and may result in additional or different conditions and/or requirements beyond those contained in this document. NVIDIA accepts no liability related to any default, damage, costs, or problem which may be based on or attributable to: (i) the use of the NVIDIA product in any manner that is contrary to this document or (ii) customer product designs.

No license, either expressed or implied, is granted under any NVIDIA patent right, copyright, or other NVIDIA intellectual property right under this document. Information published by NVIDIA regarding third-party products or services does not constitute a license from NVIDIA to use such products or services or a warranty or endorsement thereof. Use of such information may require a license from a third party under the patents or other intellectual property rights of the third party, or a license from NVIDIA under the patents or other intellectual property rights of NVIDIA.

Reproduction of information in this document is permissible only if approved in advance by NVIDIA in writing, reproduced without alteration and in full compliance with all applicable export laws and regulations, and accompanied by all associated conditions, limitations, and notices.

THIS DOCUMENT AND ALL NVIDIA DESIGN SPECIFICATIONS, REFERENCE BOARDS, FILES, DRAWINGS, DIAGNOSTICS, LISTS, AND OTHER DOCUMENTS (TOGETHER AND SEPARATELY, “MATERIALS”) ARE BEING PROVIDED “AS IS.” NVIDIA MAKES NO WARRANTIES, EXPRESSED, IMPLIED, STATUTORY, OR OTHERWISE WITH RESPECT TO THE MATERIALS, AND EXPRESSLY DISCLAIMS ALL IMPLIED WARRANTIES OF NONINFRINGEMENT, MERCHANTABILITY, AND FITNESS FOR A PARTICULAR PURPOSE. TO THE EXTENT NOT PROHIBITED BY LAW, IN NO EVENT WILL NVIDIA BE LIABLE FOR ANY DAMAGES, INCLUDING WITHOUT LIMITATION ANY DIRECT, INDIRECT, SPECIAL, INCIDENTAL, PUNITIVE, OR CONSEQUENTIAL DAMAGES, HOWEVER CAUSED AND REGARDLESS OF THE THEORY OF LIABILITY, ARISING OUT OF ANY USE OF THIS DOCUMENT, EVEN IF NVIDIA HAS BEEN ADVISED OF THE POSSIBILITY OF SUCH DAMAGES. Notwithstanding any damages that customer might incur for any reason whatsoever, NVIDIA’s aggregate and cumulative liability towards customer for the products described herein shall be limited in accordance with the Terms of Sale for the product.

Trademarks

NVIDIA, the NVIDIA logo, and Base Command are trademarks and/or registered trademarks of NVIDIA Corporation in the United States and other countries. Other company and product names may be trademarks of the respective companies with which they are associated.