Getting Started with NVIDIA Base Command Platform

1. Introduction

NVIDIA Base Command™ Platform is a comprehensive platform for businesses, data scientists, and IT teams to aid accelerated return on investment for AI initiatives. It manages the end-to-end lifecycle of AI development including workload management and resource sharing with both a graphical user interface and command line APIs, and provides integrated monitoring and reporting dashboards. Offered in a cloud-hosted solution that continuously delivers NVIDIA innovations directly into your AI workflow, NVIDIA Base Command Platform works across on-prem and cloud resources with a single pane of glass view into your AI development process.

The following is a description of the primary concepts of NVIDIA Base Command Platform.

1.1. Container Images

All applications running in NGC are containerized as Docker containers and execute in our Runtime environment. Containers are stored in the NGC Registry available at nvcr.io, accessible from both the command-line interface(CLI) and the Web UI.

1.2. Accelerated Computing Environment (ACE)

An ACE is a cluster composed of high performance GPU-based systems, networking, and network attached storage. The hardware resources of each ACE are isolated from each other - Datasets, Workspaces, and Results from one ACE cannot be used by the compute resources in a different ACE, for example.

1.3. Jobs

A Job is the fundamental method of task execution - a container running an NVIDIA Base Command Platform instance in an accelerated computing environment (ACE). A set of attributes specified at the time of submission defines a job. Chapters 8 and 10 of the NVIDIA Base Command Platform User Guide provide details about the architecture of Base Command Platform.

For a multi-node job, each node is referred to as a replica.

1.4. Datasets

Datasets are the primary data inputs to a job, always mounted as read-only to the location specified in the job. Datasets can contain data or code. Datasets are covered in detail in the Datasets section.

The location and contents of a Dataset will be the same across all replicas in a multi-node job.

1.5. Workspaces

Workspaces are shareable read-write persistent storage mountable in jobs for concurrent use or as scratch space. Workspaces can be mounted to a job in read-only mode also, making that ideal for configuration files, code, and input data use cases as the job will not corrupt or modify any of the data. Mounting workspaces in read-write mode (which is the default) in a job works well for use as a checkpoint folder, or to host data that must be further augmented before being used.

The location and contents of a Workspace will be the same across all replicas in a multi-node job.

1.6. Results

Results are read-write persistent storage mounted into every job, used to preserve log output from a job as well as job artifacts such as new models and data. The contents of a job’s Result can be converted to a Dataset.

The location of a Result will be the same across all replicas in a multi-node job, but the contents are unique to each replica.

2. Inviting Users

This section is for org or team administrators (with User Admin role) and describes the process for inviting (adding) users to NVIDIA Base Command Platform.

As the organization administrator, you must create user accounts to allow others to use the NVIDIA Base Command Platform within the organization.

  1. Log on to the NGC web UI and and select the NGC Org associated with NVIDIA Base Command Platform.

  2. Click Organization > Users from the left navigation menu.

    _images/image38.png

    This capability is available only to User Admins.

  3. Click Invite New User on the top right corner of the page.

    _images/new-ngc-invite-user.png
  4. On the new page, fill out the User Information section. Enter your screen name for First Name, and the email address to receive an invitation email.

    _images/add-user.png
  5. In the Roles section, select the appropriate context (either the organization or a specific team) and the available roles shown in the boxes below. Click Add Role to the right to save your changes. You can add or remove multiple roles before creating the user.

    _images/user-roles.png

    The following are brief descriptions of the user roles:

    NVIDIA Base Command Platform Roles

    Role

    Description

    Base Command Admin

    Admin persona with the capabilities to manage all artifacts available in Base Command Platform. The capabilities of the Admin role include resource allocation and access management.

    Base Command Viewer

    Admin persona with the read-only access to jobs, workspaces, datasets, and results within the user’s org or team.

    Registry Admin

    Registry Admin persona for managing NGC Private Registry artifacts and with the capability for Registry User Management. The capabilities of the Registry Admin role include the capabilities of all Registry roles.

    Registry Read

    Registry User persona with capabilities to only consume the Private Registry artifacts.

    Registry User

    Registry User persona with the capabilities to publish and consume the Private Registry artifacts.

    User Admin

    User Admin persona with the capabilities to only manage users.

    Refer to the section Assigning Roles for additional information.

  6. After adding roles, double-check all the fields and then click Create User on the top right. An invitation email will automatically be sent to the user.

    _images/create-user-btn.png
  7. Users that still need to accept their invitation emails are displayed in the Pending Invitations list on the Users page.

    _images/users-pending-invitations.png

3. Joining an NGC Org or Team

Before using NVIDIA Base Command Platform, you must have an NVIDIA Base Command Platform account created by your organization administrator. You need an email address to set up an account. Activating an account depends on whether your email domain is mapped to your organization’s single sign-on (SSO). Choose one of the following processes depending on your situation for activating your NVIDIA Base Command Platform account.

3.1. Joining an NGC Org or Team Using Single Sign-on

This section describes activating an account where the domain of your email address is mapped to an organization’s single sign-on.

After NVIDIA or your organization administrator adds you to a new org or team within the organization, you will receive a welcome email that invites you to continue the activation and login process.

_images/image17.png
  1. Click the link in the email to open your organization’s single sign-on page.

  2. Sign in using your single sign-on credentials.

    The Set Your Organization screen appears.

    _images/image33.png

    This screen appears any time you log in.

  3. Select the organization and team under which you want to log in and then click Continue.

    You can always change to a different organization or team you are a member of after logging in.

    The NGC web UI opens to the Base Command dashboard.

    _images/bcp-dashboard.png

3.2. Joining an Org or Team with a New NVIDIA Account

This section describes activating a new account where the domain of your email address is not mapped to an organization’s single sign-on.

After NVIDIA or your organization administrator sets up your NVIDIA Base Command account, you will receive a welcome email that invites you to continue the activation and login process.

_images/image17.png
  1. Click the Sign In link to open the sign in dialog in your browser.

    _images/create-an-account.png
  2. Fill out your information, create a password, agree to the Terms and Conditions, and click Create Account.

    You will need to verify your email.

    _images/image6.png

    The verification email is sent.

    _images/image3.png
  3. Open the email and then click Verify Email Address.

    _images/image11.png
    _images/image24.png
  4. Select your options for using recommended settings and receiving developer news and announcements, and then click Submit.

  5. Agree to the NVIDIA Account Terms of Use, select desired options, and then click Continue.

    _images/account-tou.png
  6. Click Accept at the NVIDIA GPU Cloud Terms of Use screen.

    _images/image32.png
  7. The Set Your Organization screen appears.

    _images/image33.png

    This screen appears any time you log in.

  8. Select the organization and team under which you want to log in and click Continue.

    You can always change to a different organization or team you are a member of after logging in.

    The NGC web UI opens to the Base Command dashboard.

    _images/bcp-dashboard.png

3.3. Joining an Org or Team with an Existing NVIDIA Account

This section describes activating an account where the domain of your email address is not mapped to an organization’s single sign-on (SSO).

After NVIDIA or your organization administrator adds you to a new org or team within the organization, you will receive a welcome email that invites you to continue the activation and login process.

_images/image17.png
  1. Click the Sign In link to open the sign in dialog in your browser.

    _images/image42.png
  2. Enter your password and then click Log In.

    The Set Your Organization screen appears.

    _images/image33.png

    This screen appears any time you log in.

  3. Select the organization and team under which you want to log in and click Continue.

    You can always change to a different organization or team you are a member of after logging in.

    The NGC web UI opens to the Base Command dashboard.

    _images/bcp-dashboard.png

4. Signing in to Your Account

During the initial account setup, you are signed into your NVIDIA Base Command Platform account on the NGC web site. This section describes the sign in process that occurs at a later time. It also describes the web UI sections of NVIDIA Base Command Platform at a high level, including the UI areas for accessing available artifacts and actions available to various user roles.

  1. Open https://ngc.nvidia.com and click Continue by one of the sign-on choices, depending on your account.

    • NVIDIA Account: Select this option if single sign-on (SSO) is not available.

    • Single Sign-on (SSO): Select this option to use your organization’s SSO. You may need to verify with your organization or Base Command Platform administrator whether SSO is enabled.

    _images/login-selection.png
  2. Continue to sign in using your organization’s single sign-on.

  3. Set the organization you wish to sign in under, then click Continue.

You can always change to a different org or team that you are a member of after logging in.

The following image and table describe the main features in the left navigation menu of the web site, including the controls for changing the org or team.

_images/image31.png
NGC Web UI Sections

ID

Description

1

CATALOG:. Click this menu to access a curated set of GPU-optimized software. It consists of containers, pre-trained models, Helm charts for Kubernetes deployments, and industry-specific AI toolkits with software development kits (SDKs) that are periodically released by NVIDIA and are read-only for a Base Command Platform user.

2

PRIVATE REGISTRY: Click this menu to access the secure space to store and share custom containers, models, resources, and Helm charts within your enterprise.

3

BASE COMMAND:.Click this menu to access controls for creating and running Base Command Platform jobs.

4

ORGANIZATION: (User Admins only) Click this menu to manage users and teams.

5

User Info: Select this drop down list to view user information, select the org to operate under, and download the NGC CLI and API key, described later in this document.

6

Team Selection: Select this drop down list to select which team to operate under.

5. Introduction to the NGC CLI

This chapter introduces the NGC Base Command Platform CLI, installable on your workstation for interfacing with Base Command Platform. In this section you will learn about generic features of CLI applicable to all commands as well as CLI modules that map to the Web UI areas that you have learned about in a previous chapter.

The NGC Base Command Platform CLI is a command-line interface for managing content within the NGC Registry and for interfacing with the NVIDIA Base Command Platform. The CLI operates within a shell and lets you use scripts to automate commands.

With NGC Base Command Platform CLI, you can connect with:

  • NGC Catalog

  • NGC Private Registry

  • User Management (available to org or team User Admins only)

  • NVIDIA Base Command Platform workloads and entities

5.1. Installing NGC CLI

To install NGC CLI, perform the following:

  1. Log in to your NVIDIA Base Command Platform account on the NGC website (https://ngc.nvidia.com).

  2. In the top right corner, click your user account icon and select an org that belongs to the Base Command Platform account.

  3. From the user account menu, select Setup, then click Downloads under CLI from the Setup page.

  4. From the CLI Install page, click the Windows, Linux, or macOS tab, according to the platform from which you will be running NGC CLI.

  5. Follow the Install instructions that appear on the OS section that you selected.

  6. Verify the installation by entering ngc --version. The output should be NGC CLI x.y.z where x.y.z indicates the version.

5.2. Generating Your NGC API Key

This section describes how to obtain an API key needed to configure the CLI application so you can use the CLI to access locked container images from the NGC Catalog, access content from the NGC Private Registry, manage storage entities, and launch jobs.

The NGC API key is also used for docker login to manage container images in the NGC Private Registry with the docker client.

  1. Sign in to the NGC web UI.

    1. From a browser, go to NGC sign in page and then enter your email.

    2. Click Continue by the Sign in with Enterprise sign in option.

    3. Enter the credentials for you organization.

  2. In the top right corner, click your user account icon and then select an org that belongs to the NVIDIA Base Command Platform account.

  3. Click your user account icon again and select Setup.

    _images/image13.png
  4. Click Get API key to open the Setup > API Key page.

  5. Click Get API Key to generate your API key. A warning message appears to let you know that your old API key will become invalid if you create a new key.

  6. Click Confirm to generate the key.

    Your API key appears.

    You only need to generate an API key once. NGC does not save your key, so store it in a secure place. (You can copy your API key to the clipboard by clicking the copy icon to the right of the API key. )

    Should you lose your API key, you can generate a new one from the NGC website. When you generate a new API Key, the old one is invalidated.

5.3. Getting Help Using NGC CLI

To install NGC CLI, perform the following:

  1. Log in to your NVIDIA Base Command Platform account on the NGC website (https://ngc.nvidia.com).

  2. In the top right corner, click your user account icon and select an org that belongs to the Base Command Platform account.

  3. From the user account menu, select Setup, then click Downloads under CLI from the Setup page.

  4. From the CLI Install page, click the Windows, Linux, or macOS tab, according to the platform from which you will be running NGC CLI.

  5. Follow the Install instructions that appear on the OS section that you selected.

  6. Verify the installation by entering ngc --version. The output should be NGC CLI x.y.z where x.y.z indicates the version.

5.3.1. Getting Help from the Command Line

To run an NGC CLI command, enter ngc followed by the appropriate options.

To see a description of available options and command descriptions, use the option -h after any command or option.

Example 1: To view a list of all the available options for the ngc command, enter

$ ngc -h

Example 2: To view a description of all ngc base-command commands and options, enter

$ ngc base-command -h

Example 3: To view a description of the dataset commands, enter

$ ngc dataset -h

5.3.2. Viewing NGC CLI Documentation Online

The NGC Base Command Platform CLI documentation provides a reference for all the NGC Base Command Platform CLI commands and arguments. You can also access the CLI documentation from the NGC web UI by selecting Setup from the user drop down list and then clicking Documentation from the CLI pane.

5.4. Configuring the CLI for your Use

To make full use of NGC Base Command Platform CLI, you must configure it with your API key using the ngc config set command.

While there are options you can use for each command to specify org and team, as well as the output type and debug mode, you can also use the ngc config set command to establish these settings up front.

If you have a pre-existing set up, you can check the current configuration using:

$ ngc config current

To configure the CLI for your use, issue the following:

$ ngc config set
Enter API key. Choices: [<VALID_APIKEY>, 'no-apikey']:
Enter CLI output format type [ascii]. Choices: [ascii, csv, json]:
Enter org [nv-eagledemo]. Choices: ['nv-eagledemo']:
Enter team [nvtest-repro]. Choices: ['nvtest-repro, 'no-team']:
Enter ace [nv-eagledemo-ace]. Choices: ['nv-eagledemo-ace', 'no-ace']:
Successfully saved NGC configuration to C:\Users\jsmith\.ngc\config

If you are a member of several orgs or teams, be sure to select the ones associated with NVIDIA Base Command Platform.

5.5. Running the Diagnostics

Diagnostic information is available which provides details to assist in isolating issues. You can provide this information when reporting issues with the CLI to NVIDIA support.

The following diagnostic information is available for the NGC Base Command Platform CLI user:

  • Current time

  • Operating system

  • Disk usage

  • Current directory size

  • Memory usage

  • NGC CLI installation

  • NGC CLI environment variables (whether set and or not set)

  • NGC CLI configuration values

  • API gateway connectivity

  • API connectivity to the container registry and model registry

  • Data storage connectivity

  • Docker runtime information

  • External IP

  • User information (ID, name, and email)

  • User org roles

  • User team roles

Syntax

$ ngc diag [all,client,install,server,user]

where

all

Produces the maximum amount of diagnostic output.

client

Produces diagnostic output only for the client machine.

install

Produces diagnostic output only for the local installation.

server

Produces diagnostic output only for the remote server.

user

Produces diagnostic output only for the user configuration.

6. End-to-End Example Using PyTorch Super Resolution

In this chapter, we’ll delve into practical problem-solving techniques using the NVIDIA Base Command Platform (BCP). To illustrate the approach, we’ll leverage the PyTorch Super Resolution example packaged inside the NGC PyTorch container.

The Base Command Platform concepts we’ll cover are:

  • Leveraging a JupyterLab Quick Start

  • Bringing data into a Workspace

  • Running basic training jobs and monitoring their performance

  • Cloning and modifying jobs

  • Using NGC Secrets

  • Uploading a model from a training job into your private NGC registry

This guide will focus on using the web interface and assumes that the user has been onboarded to the Org as both a Base Command Platform user and a Private Registry user. Let’s get started!

6.1. Launching a Quick Start Job

Early in the lifecycle of an AI project, it is common to begin with nothing more than a set of data that you need to work with. But how do we get that data inside BCP? More importantly, how do we ensure the data is saved?

  1. Let’s begin with a Quick Start job. Quick Start is a great way to experiment with Base Command Platform, as it provides default values for common job types. To launch the JupyterLab Quick Start job, click Launch in the JupyterLab Quick Start job card.

    _images/quickstart-job-image-001.png

    Using the default parameters of a Quick Start job automates several configuration parameters that we’ll delve into in more detail later in this guide:

    • Choosing an ACE.

    • Selecting a set of compute and memory resources to assign to the Quick Start job, referred to as an instance.

    • Choosing a container image to run the job within, using the assigned instance resources.

    • Creating a Workspace and determining the path at which the Workspace should be made available.

    Workspaces, capable of accepting read and write operations, are ideal for experimenting with code and data throughout a project’s lifecycle.

  2. The Overview page of the job will load automatically. Shortly thereafter, provided adequate resources are available in the targeted ACE, the job will be ready for use.

    _images/quickstart-job-image-002.png
  3. Scroll down to the bottom of the Overview tab. Inside the Data Input pane, select the Workspaces tab. Make a note of the Mount Point value for the Workspace provisioned as part of the Quick Start job.

    _images/quickstart-job-image-003.png
  4. Once you have noted down the Mount Point of your Workspace, click Launch JupyterLab in the top right corner of the page.

    _images/quickstart-job-image-004.png
  5. A new tab or window will open to display the JupyterLab web interface.

    _images/quickstart-job-image-005.png

The following section will pick up from the newly-opened JupyterLab web interface.

6.2. Downloading and Manipulating Data

Now that we have a running Quick Start job, we’ll work within it to copy our PyTorch Super Resolution example out of the NGC PyTorch container and prepare the necessary dataset for a future training job.

It is necessary to copy the PyTorch Super Resolution example outside of the container because its code is written to expect its dataset inside its directory structure. We want to download and prepare the dataset once and reuse it in multiple jobs, so making a copy of the PyTorch Super Resolution example inside our Workspace is the most straightforward way to enable that.

  1. Open a terminal from the JupyterLab Launcher and enter all subsequent commands inside the resulting terminal.

    _images/data-ingest-image-001.png
  2. We’ll copy the PyTorch Super Resolution example to our Workspace by referencing the path recorded in the previous section. Make the copy running the following command, replacing the /bcp/workspaces/140373_quick-start-jupyterlab-workspace_launchpad-iad2-ace argument with your Workspace’s path.

    cp -r /opt/pytorch/examples/upstream/super_resolution /bcp/workspaces/140373_quick-start-jupyterlab-workspace_launchpad-iad2-ace
    

    Note

    When using your specific Workspace path, typing /bcp/workspaces/ followed by a Tab will automatically populate the desired Workspace path.

  3. We’ll change our working directory to the path that we copied the PyTorch Super Resolution example to (again, replacing 140373_quick-start-jupyterlab-workspace_launchpad-iad2-ace with your specific value).

    cd /bcp/workspaces/140373_quick-start-jupyterlab-workspace_launchpad-iad2-ace/super_resolution
    
  4. The PyTorch Super Resolution example can automatically download the data it requires to perform its specified training workload, but we’ll manually download the data and create directories to separate the download and training phases.

    Run the following code to create the expected data directory and enter it.

    1mkdir dataset
    2cd dataset
    
  5. Next, download the data and extract it in the dataset directory that we created and enter the Workspace, removing the unneeded .tgz file after extraction.

    1wget http://www2.eecs.berkeley.edu/Research/Projects/CS/vision/bsds/BSDS300-images.tgz
    2tar -xzf BSDS300-images.tgz
    3rm BSDS300-images.tgz
    
  6. To validate that the data is ready for use, run the basic ls and du -sh commands as shown below (along with the expected output).

    1root@5490537:/bcp/workspaces/140373_quick-start-jupyterlab-workspace_launchpad-iad2-ace/super_resolution/dataset# ls
    2BSDS300
    
    1root@5490537:/bcp/workspaces/140373_quick-start-jupyterlab-workspace_launchpad-iad2-ace/super_resolution/dataset# du -sh BSDS300/
    222M     BSDS300/
    

    All data written in your Workspace’s directory will be saved, even after our Quick Start job has ended.

    The following example screenshot shows all of the commands above run in a JupyterLab Quick Start job. Remember, your specific paths will differ, but the outcome of each command should be the same.

    _images/data-ingest-image-002.png
  7. There is also a /results directory that was automatically created and mounted in our Quick Start job. If you’d like, you can run the following command to investigate the log file generated during our Quick Start job.

    cat /results/joblog.log
    

    This log file will be saved after our Quick Start job has ended.

    It is important to remember that, in this job and other Base Command Platform jobs, only data written in Result and Workspace paths will be saved for future use. Data written to any other path during this job will not be saved, even if we leverage the same container image in a subsequent job.

  8. Close the JupyterLab tab or window and return to the Quick Start job’s page in the Base Command Platform. A Quick Start job is automatically killed after two hours - if you have completed this step in less than two hours, end the job by clicking Kill Job in the top right corner of the page.

    _images/data-ingest-image-003.png
  9. Navigate back to the Dashboard by using the left navigation menu.

    _images/data-ingest-image-004.png

6.3. Launching and Monitoring Training Jobs

In this section, we’ll move out of Quick Start into launching our custom jobs. Quick Start has pre-populated all the fields that we’ll manually fill in.

  1. To get started, click Create Job on the Dashboard page.

    _images/single-instance-training-image-001.png

    When filling out this form, our goal is to start a job, just like we did when using Quick Start. However, this job will not provide a JupyterLab interface, and will simply run commands that we specify in this form.

  2. Select an Accelerated Computing Environment (ACE). Each ACE is a discrete set of compute, network, and storage resources.

    Note

    If you have access to multiple ACEs, select the same ACE where your Quick Start job was run to ensure you can access your Workspace.

    _images/single-instance-training-image-002.png
  3. Once you select an ACE, you will be presented with several instance options. As mentioned in a previous step, an instance is a set of compute and memory resources that can be provisioned from an ACE for use as part of a Job.

    We’ll select the dgxa100.80g.1.norm option in this example. The resources available as part of that instance are denoted in the table contents to the right of the instance Name.

    _images/single-instance-training-image-003.png
  4. Scroll down to the Inputs pane containing two tabs - Datasets and Workspaces.

    Since we are trying to use a Workspace, we’ll click the Workspaces tab.

    _images/single-instance-training-image-004.png
  5. Select the Workspace created and used as part of the Quick Start job we launched. In our case, that Workspace is named 140373_quick-start-jupyterlab-workspace_launchpad-iad2-ace.

    _images/single-instance-training-image-005.png
  6. Fill in /example as the Mount point value once the Workspace is selected. All data written to this Workspace during our Quick Start job will be available by accessing the /example directory inside this new job. Note that this mount point value differs from what we used in the Quick Start job. Every time you launch a job, you can specify the mount point for Datasets, Workspaces, and Results according to your exact needs or preferences.

    _images/single-instance-training-image-006.png
  7. For now, we’ll leave the Data Output and Secrets sections at their default values.

  8. Scroll down to Container Selection. Select the nvidia/pytorch container from the Select a Container dropdown. You can type in this field to narrow down the container options instead of scrolling - all NGC containers are present in this list!

    _images/single-instance-training-image-007.png
  9. Select a container tag value of 23.10-py3 from the Select a Tag dropdown. The Tag OS and Tag Arch dropdowns will automatically populate.

    _images/single-instance-training-image-008.png
  10. In the Run Command field, paste the following, noting the importance of the Workspace path we chose above. The exact command depends on that path.

    cd /example/super_resolution; python main.py --upscale_factor 3 --batchSize 4 --testBatchSize 100 --nEpochs 30 --lr 0.001
    
    _images/single-instance-training-image-009.png
  11. We’ll leave the remainder of the Container Selection fields at their default values, along with the fields in Job Priority & Order.

  12. Scroll down to Launch Job, and update the Name field to super-resolution-example-job.

    _images/single-instance-training-image-010.png
  13. Leave the Preemption Options, Total Runtime, and Time Slice values set to their default values.

  14. In the Custom Labels (Optional) field, type example and press Enter/Return - this will add a Label to this job facilitating job searches in an active environment.

    _images/single-instance-training-image-011.png
  15. Verify that the output at the bottom of the Create Job page matches what is shown in the above screenshot, except for the --ace, --org, and --workspace arguments.

    If everything matches, click Launch Job in the top right corner of the page.

    _images/single-instance-training-image-012.png

Congratulations, you have submitted your first custom Base Command Platform job!

6.4. Performance Analysis and Job Experimentation

Now that we have launched a job, let’s find out how to interact with it.

  1. After submitting the super-resolution-example-job, you should have been automatically redirected to the Jobs page. If not, go to Base Command > Jobs from the left navigation menu.

    _images/job-analysis-image-001.png
  2. The launched job should be at or near the top of the list on the Jobs page. You can also filter based on labels by typing example into the search bar on the page and pressing Enter/Return.

    _images/job-analysis-image-002.png
  3. Click on the super-resolution-example-job row to navigate to its dedicated page. Here, you’ll find several tabs:

    • Overview (the default tab)

    • Telemetry

    • Status History

    • Results

    • Log

    Once the Status field on the Overview tab shows a Running value, we can start to investigate the job. We’ll wait until the job has been running for at least five minutes, according to the Duration field, to ensure we have some data to look at.

    _images/job-analysis-image-003.png
  4. Click on the Log tab to follow any text output being generated by the job.

    _images/job-analysis-image-004.png

    You will see a stream of text output being generated by the job, indicating the job’s progress through the Epoch quantity we specified in the job. That means the job is running and generating output!

  5. Click on the Telemetry tab to quickly assess the job’s performance.

    If you hover over the top graph, you might notice something concerning - the job is running, but the GPU does not appear active!

    _images/job-analysis-image-005.png

    If we scroll down to the graph that shows CPU Usage and hover over it, we’ll see that the CPUs are quite active. What happened?

    _images/job-analysis-image-006.png

    If you were to look at the README.md file for the PyTorch Super Resolution example, you would see that we forgot to add the –cuda argument, which enables CUDA (and, therefore, GPU) use.

  6. If you let the super-resolution-example-job run to completion, it will take a relatively long time to complete. In our example usage with A100 instances, it took around one hour and 25 minutes to complete!

    You can either let the job run to completion or cancel it to save resources. To cancel the job, click the ellipsis in the top right corner of the page, and select the Kill Job option.

    _images/job-analysis-image-007.png
  7. Once the job is either complete or canceled, click the ellipsis again and select the Clone Job option.

    _images/job-analysis-image-008.png

    This will retain all of the configurations of the job but will not submit the job - leaving us free to make small changes as necessary. This is a crucial part of the experimentation loop with Base Command Platform - iterating across several jobs until everything behaves exactly as we would like.

  8. Scroll down to the Container Selection pane, and make the following changes to the Run Command field:

    • Change the --nEpochs value to 300 (it will be 30 before your change).

    • Add the --cuda argument - remember to add a space before the argument.

    It should look like this:

    cd /example/super_resolution; python main.py --upscale_factor 3 --batchSize 4 --testBatchSize 100 --nEpochs 300 --lr 0.001 --cuda
    
  9. Scroll down to the Launch Job pane and add the cuda label to the Custom Labels list, pressing Enter/Return to save the label to the job.

    _images/job-analysis-image-009.png
  10. Verify that the output at the bottom of the Create Job page matches what is shown in the above screenshot, except for the --ace, --org, and --workspace arguments.

    If everything matches, click Launch Job in the top right corner of the page.

    _images/job-analysis-image-010.png
  11. You will again be redirected to the Jobs page, where you can now search for two tags to narrow your list further - example and cuda.

    _images/job-analysis-image-011.png
  12. Click on the new job, and watch for the Status field on the Overview tab to show a value of Running. This time we’ll look at the job a bit earlier (around the 2-minute mark). You’ll see why in a moment!

    _images/job-analysis-image-012.png
  13. Navigate to the Telemetry tab to see if your job uses a GPU resource.

    _images/job-analysis-image-013.png

    We are! Both the TIME GPUS ACTIVE and GPU UTILIZATION values will be above zero, and the GPU Active values inside the graph will also be above zero.

Within 4-5 minutes, this job will complete. Remember, we not only enabled CUDA usage in this version of the job, but we also increased the Epoch count by a factor of 10. Compared to the job that only used CPU resources, we did ten times the amount of work in around 6% of the time!

In the next section, we’ll learn how to use secrets to embed credentials in our jobs securely.

6.5. Creating Secrets

This section will focus on creating and using NGC Secrets. At a minimum, you should have your NGC API key available for use as part of a secret.

If you still need to create an NGC API key, refer to this section of the getting started guide.

A secret is a way to store sensitive information to authenticate with external systems inside a Base Command Platform job.

  1. To create a secret, click on the dropdown menu in the top right corner of the Base Command Platform web interface and select the Setup option.

    _images/secrets-image-001.png
  2. On the Setup page, click View Secrets in the Secrets pane.

    _images/secrets-image-002.png
  3. To create a new secret, click Add Secret in the top right corner of the Secrets page.

    _images/secrets-image-003.png
  4. Set the secret Name as ngc, and the Description to ngc api key. Under Secret Pairs, set the Key field to ngc_api_key, and set the Value field to your API key. Click + to add the key-value pair.

    Ensure the Enable Secret toggle is set to on (green), and click Create Secret.

    _images/secrets-image-004.png
  5. Return to the Jobs page by clicking on the link on the left navigation menu.

    _images/secrets-image-005.png

We are now ready to use our secret to upload a model!

6.6. Uploading Model to NGC Private Registry

  1. To get started, search for the example tag in the search box. Clone the top job on the page by clicking the ellipsis to the right of it and selecting Clone Job.

    _images/upload-model-image-001.png
  2. Scroll down to the Secrets section, and select the secret we just created for use in this job.

    _images/upload-model-image-002.png
  3. Scroll down to the Container Selection pane and delete the current contents of the Run Command field. Replace it with the following:

    jupyter lab --ip=0.0.0.0 --allow-root --no-browser --NotebookApp.token='' --notebook-dir=/ --NotebookApp.allow_origin='*'
    
    _images/upload-model-image-003.png
  4. You will also need to ensure the JupyterLab web interface is made available outside the job so you can access it from your web browser. JupyterLab Quick Start handled this for us automatically.

    In the Protocol dropdown, select HTTPS. Once a Protocol is selected, the Container Port field will become available for user input. Enter 8888 (this is the port at which JupyterLab is accessible by default).

    Click Add to the right of this row of inputs.

    _images/upload-model-image-004.png
  5. Scroll down to the Launch Job pane, remove the cuda label if present, and add the model, upload, and ngc labels to the Custom Labels list, pressing Enter/Return to save each label to the job.

    Update the Job Name field to upload-super-resolution-model-to-ngc.

    _images/upload-model-image-005.png
  6. Verify that the output at the bottom of the Create Job page matches what is shown in the above screenshot, except for the --ace, --org, and --workspace arguments.

    If everything matches, click Launch Job in the top right corner of the page.

    _images/upload-model-image-006.png
  7. You will again be redirected to the Jobs page, where you can search for several tags to narrow your list further. To search multiple tags simultaneously, type all tags in the same query, separating each by a command and space. Try entering upload, ngc in the search bar.

    _images/upload-model-image-007.png
  8. Again, watch for the Status field on the Overview tab to show a value of Running.

    Once the job is running, click the URL that appears in the Service Mapped Ports pane. This will open a JupyterLab web interface, just like we did for our Quick Start job.

    _images/upload-model-image-008.png
  9. Open a terminal via the JupyterLab Launcher.

    _images/upload-model-image-009.png
  10. In the resulting terminal, run the following commands.

    1export NGC_CLI_API_KEY=$ngc_api_key
    2wget --content-disposition https://api.ngc.nvidia.com/v2/resources/nvidia/ngc-apps/ngc_cli/versions/3.34.0/files/ngccli_linux.zip -O ngccli_linux.zip && unzip ngccli_linux.zip
    

    NGC CLI is now downloaded, and the secret we created has been used to set the environment variable for NGC CLI authentication.

  11. To create a model entry in your org’s private registry for this model, run the following (replacing your-org with your specific Base Command Platform org):

    ./ngc-cli/ngc registry model create --org your-org --application super-resolution --framework pytorch --format ascii --precision fp16 --short-desc "pytorch super resolution model" your-org/super-resolution
    

    Expect the following output from the terminal:

     1Successfully created model 'your-org/super-resolution'.
     2----------------------------------------------------
     3Model Information
     4Name: super-resolution
     5Application: super-resolution
     6Framework: pytorch
     7Model Format: ascii
     8Precision: fp16
     9Short Description: pytorch super resolution model
    10Display Name:
    11Logo:
    12Org: your-org
    13Team:
    14Built By:
    15Publisher:
    16Created Date: 2023-11-21T20:21:04.013Z
    17Updated Date: 2023-11-21T20:21:04.013Z
    18Labels
    19Latest Version ID:
    20Latest Version Size (bytes): 0
    21Public Dataset Used
    22    Name:
    23    Link:
    24    License:
    25----------------------------------------------------
    
  12. To upload the 300th epoch’s model from our PyTorch Super Resolution example training to this Base Command Platform org’s private registry, run the following (replacing your-org with your specific Base Command Platform org):

    ./ngc-cli/ngc registry model upload-version --org your-org --source /example/super_resolution/model_epoch_300.pth your-org/super-resolution:1.0
    

    Expecting the following output from the terminal:

     1Starting upload of 1 files (238.34 KB)
     2    ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ • 238.3/238.3 KiB • Remaining: 0:00:00 • ? • Elapsed: 0:00:00 • Total: 0 - Completed: 1 - Failed: 0
     3
     4---------------------------------------------------------------------------
     5    Model ID: super-resolution[version=1.0]
     6    Upload status: Completed
     7    Uploaded local path model: /example/super_resolution/model_epoch_300.pth
     8    Total files uploaded: 1
     9    Total transferred: 238.34 KB
    10    Started at: 2023-11-21 20:23:31
    11    Completed at: 2023-11-21 20:23:33
    12    Duration taken: 2s
    13---------------------------------------------------------------------------
    

    We can now close this JupyterLab web interface.

  13. From the job page, navigate to the Models page under Private Registry in the left navigation menu.

    _images/upload-model-image-010.png
  14. A card will load for our PyTorch Super Resolution model. Click on it when it loads.

    _images/upload-model-image-011.png
  15. Click on the Version History tab to verify that our 1.0 model version is present.

    _images/upload-model-image-012.png
  16. Click on the File Browser tab to see the uploaded .pth file.

    _images/upload-model-image-013.png

Congratulations! You have uploaded a model for use by yourself and others in your Base Command Platform Org!

For more examples of ways to use Base Command Platform, visit the Tutorials section of the Base Command Platform User Guide.

Or, if you’re ready to get started on your own, explore the NGC Catalog and start building in your Base Command Platform environment.

Notices

Notice

This document is provided for information purposes only and shall not be regarded as a warranty of a certain functionality, condition, or quality of a product. NVIDIA Corporation (“NVIDIA”) makes no representations or warranties, expressed or implied, as to the accuracy or completeness of the information contained in this document and assumes no responsibility for any errors contained herein. NVIDIA shall have no liability for the consequences or use of such information or for any infringement of patents or other rights of third parties that may result from its use. This document is not a commitment to develop, release, or deliver any Material (defined below), code, or functionality.

NVIDIA reserves the right to make corrections, modifications, enhancements, improvements, and any other changes to this document, at any time without notice.

Customer should obtain the latest relevant information before placing orders and should verify that such information is current and complete.

NVIDIA products are sold subject to the NVIDIA standard terms and conditions of sale supplied at the time of order acknowledgement, unless otherwise agreed in an individual sales agreement signed by authorized representatives of NVIDIA and customer (“Terms of Sale”). NVIDIA hereby expressly objects to applying any customer general terms and conditions with regards to the purchase of the NVIDIA product referenced in this document. No contractual obligations are formed either directly or indirectly by this document.

NVIDIA products are not designed, authorized, or warranted to be suitable for use in medical, military, aircraft, space, or life support equipment, nor in applications where failure or malfunction of the NVIDIA product can reasonably be expected to result in personal injury, death, or property or environmental damage. NVIDIA accepts no liability for inclusion and/or use of NVIDIA products in such equipment or applications and therefore such inclusion and/or use is at customer’s own risk.

NVIDIA makes no representation or warranty that products based on this document will be suitable for any specified use. Testing of all parameters of each product is not necessarily performed by NVIDIA. It is customer’s sole responsibility to evaluate and determine the applicability of any information contained in this document, ensure the product is suitable and fit for the application planned by customer, and perform the necessary testing for the application in order to avoid a default of the application or the product. Weaknesses in customer’s product designs may affect the quality and reliability of the NVIDIA product and may result in additional or different conditions and/or requirements beyond those contained in this document. NVIDIA accepts no liability related to any default, damage, costs, or problem which may be based on or attributable to: (i) the use of the NVIDIA product in any manner that is contrary to this document or (ii) customer product designs.

No license, either expressed or implied, is granted under any NVIDIA patent right, copyright, or other NVIDIA intellectual property right under this document. Information published by NVIDIA regarding third-party products or services does not constitute a license from NVIDIA to use such products or services or a warranty or endorsement thereof. Use of such information may require a license from a third party under the patents or other intellectual property rights of the third party, or a license from NVIDIA under the patents or other intellectual property rights of NVIDIA.

Reproduction of information in this document is permissible only if approved in advance by NVIDIA in writing, reproduced without alteration and in full compliance with all applicable export laws and regulations, and accompanied by all associated conditions, limitations, and notices.

THIS DOCUMENT AND ALL NVIDIA DESIGN SPECIFICATIONS, REFERENCE BOARDS, FILES, DRAWINGS, DIAGNOSTICS, LISTS, AND OTHER DOCUMENTS (TOGETHER AND SEPARATELY, “MATERIALS”) ARE BEING PROVIDED “AS IS.” NVIDIA MAKES NO WARRANTIES, EXPRESSED, IMPLIED, STATUTORY, OR OTHERWISE WITH RESPECT TO THE MATERIALS, AND EXPRESSLY DISCLAIMS ALL IMPLIED WARRANTIES OF NONINFRINGEMENT, MERCHANTABILITY, AND FITNESS FOR A PARTICULAR PURPOSE. TO THE EXTENT NOT PROHIBITED BY LAW, IN NO EVENT WILL NVIDIA BE LIABLE FOR ANY DAMAGES, INCLUDING WITHOUT LIMITATION ANY DIRECT, INDIRECT, SPECIAL, INCIDENTAL, PUNITIVE, OR CONSEQUENTIAL DAMAGES, HOWEVER CAUSED AND REGARDLESS OF THE THEORY OF LIABILITY, ARISING OUT OF ANY USE OF THIS DOCUMENT, EVEN IF NVIDIA HAS BEEN ADVISED OF THE POSSIBILITY OF SUCH DAMAGES. Notwithstanding any damages that customer might incur for any reason whatsoever, NVIDIA’s aggregate and cumulative liability towards customer for the products described herein shall be limited in accordance with the Terms of Sale for the product.

Trademarks

NVIDIA, the NVIDIA logo, and Base Command are trademarks and/or registered trademarks of NVIDIA Corporation in the United States and other countries. Other company and product names may be trademarks of the respective companies with which they are associated.