Advanced Features - NVIDIA Docs

This chapter describes the more advanced features of Fleet Command including administrative settings, security overrides, and remote management.

Fleet Command Settings

This section helps to understand Fleet Command settings. To view the settings, navigate to the Settings page in Fleet Command.

System: Get the latest Fleet Command system image and source code for the system image.
Remote Management: Enable or disable remote console and system reboot option and modify the remote console timeouts. Enable or disable remote application access and modify the remote application access timeouts. For more information on remote console or remote application access, refer to Remote Management.
Deployment Security Overrides: Enable or disable selecting security overrides when creating deployments. For more information, refer to Deploying an Application.

Important

Enabling or disabling deployment security overrides in settings only affects new deployments created after the setting is applied. Disabling security overrides does not affect existing deployments that have overrides configured. These deployments must be deleted and recreated with security overrides disabled.
Logs: Enable All Logs to capture all system and location logs and enable application and deployment logging. For more information on logs, refer to Fleet Command Logs.

By default, both logging options are disabled and can be enabled by an admin user.

Remote Management

Fleet Command provides the following ways to remotely access and manage your systems and applications deployed at the edge site.

Remote Console: enables starting remote shell sessions.
Remote Application Access: enables access to the web-based services for applications.

Remote Console

Fleet Command allows you to access your system remotely with the help of remote console.

Remote console allows Fleet Command administrators to start and access shell sessions for each system. One console can be created for each system, up to 20 consoles per organization. Multiple administrators can access the system through the remote console. Remote console starts a shell session within the system for each user.

To access the console remotely, TLS port 443 outbound connection must be allowed from the system. Refer to Edge Site Requirements for additional information.

Enabling Remote Console

Select Fleet Command > Settings.
Enable the Enable Remote Console and system reboot switch.
If necessary, configure the timeouts as required for your use case.
- Max is the maximum time allowed for remote console sessions; after max time is reached all the active remote console sessions will be closed automatically.
- Inactivity is the timeout for idle shell sessions, each user will get their own shell session, and inactivity time will be tracked independently for each user.
Session time should range from 2 minutes up to an hour. NVIDIA recommends setting the idle time to less than or equal to the session time.

Accessing Remote Console

Select Fleet Command > Locations and then select the location from the table.
On the location details page, click the options button on the system and select Start Remote Console.

The console starts in a new browser tab or window.

Copy
Copied!

            
            ***********************************************************************
Current user 'rcuser' is not a sudoer, please 'su admin' to run sudo.
***********************************************************************
rcuser@demo-system-0:~$

To switch to the root user and perform troubleshooting, perform the following steps.
1. Switch user to the admin user:
  Copy
  
  Copied!
```
            
            $ su admin
        
```
  Enter the Fleet Command Administrator password.
2. Switch user to the root user:
  Copy
  
  Copied!
```
            
            $ sudo su -
        
```
  Enter the Fleet Command Administrator password.
- The rcuser and admin user are the basic users that can run nvidia-smi and basic Linux commands without sudo.
- The root user can run commands like kubectl to debug the system.
Right-click on the shell to display the available options for remote console.

If you don’t have a physical keyboard, you can use the Onscreen Keyboard option with remote console.

For container runtime troubleshooting, you can use the ctr and crictl commands.

Refer to the following resources for more information:

CTR CLI
Container Runtime Interface (CRI) CLI
Kubernetes Cheat Sheet for any deployment issues from remote console.

The command history of the Remote Console is purged when the console shell session is closed or the max session timeout is met.

Reconnecting to a Remote Console

If the remote session is closed for any reason, you can Open the remote console session from the Remote console banner under each location. You can click End the active remote consoles to close all active remote console sessions for this system.

For each Fleet Command organization, a maximum of 20 active remote console sessions are allowed. If more than 20 remote console sessions are attempted, you receive a message that indicates the limit is reached.

Disabling Remote Console

Disabling remote console within the Fleet Command settings immediately closes all remote consoles within the organization.

Select Fleet Command > Settings.
Disable the Enable Remote Console and system reboot switch.

If there are any active remote consoles, you are prompted before disabling the remote management.

Rebooting a System

While a system is rebooting, applications stop running and then start after the system reboot.

Select Fleet Command > Locations and then select the location from the table.
On the location details page, click the options button on the system and select Reboot System.

On the Reboot System window, click Reboot.

If a remote console session is active, you are prompted to confirm the reboot.

After the reboot, the location and system banners indicate a successful reboot.

Remote Application Access

Fleet Command allows you to access web-based application services running on edge systems remotely from your local machine. Remote application access is available with a unique URL to users with Fleet Command administrator and operator roles. Remote application access features the following:

A configurable time allowance to access services. When the time expires, remote access to services for that application will automatically end. This greatly simplifies resource management and frees up available remote sessions for other services.
Accessible by multiple users in multiple locations. Regardless of the remote access origin, all admin and operator users in the same organization have access to the remote services.

Configuring your Remote Application

To support remote application access, you must explicitly configure applications to allow remote access via the appProtocol field in the Kubernetes service.

Fleet Command invokes your application using a mapping to the root location of your web service. If your application requires additional paths for access, Fleet Command recommends that you configure a redirect from the application root location to the full path of your application.

The following example shows a mapping of the full path of a web application to the resulting remote application service URL :

The web application root location http://<ip_address_of_node>:31115 is mapped to https://<fc_remaac_location>-31115.rmsession.<org>.egx.nvidia.com where <fc_remaac_location> is the name of your system and <org> is your organization name.
Fleet Command will invoke your application using the mapped root location - https://<fc_remaac_location>-31115.rmsession.nvidia.egx.nvidia.com, as in this example.
If you have already set up a redirect from http://<ip_address_of_node>:31115 to http://<ip_address_of_node>:31115/WebRTCApp/play.html?name=videoanalytics, your application will launch automatically. Fleet Command recommends using NGINX Ingress to set up the redirect and to access the application via the Ingress NodePort.
If the redirect is not in place, you will need to manually append the rest of the application path to the mapped root location in the browser.

Refer to Fleet Command Helm Chart Requirements for more information on configuring the application.

Enabling Remote Application Access

Select Fleet Command > Settings.
Enable the Enable remote application access switch.
Optional: Configure the access timeout.

Starting Remote Application Access

This section is for Fleet Command users with administrator or operator roles and describes the steps for starting remote application access.

Select Fleet Command > Deployments and then click the row for the deployment with the application to access.
On the deployment details page, find the location to access and click the Application Services tab.

To start remote application access for a service, click on a Service Name link.

The tab displays the application services deployed at this location. The list displays all services for this application, but only web-based services with unique URLs are remotely accessible.

If the application service is not explicitly enabled for external access or the appProtocol field is not applied to the application service, the service is still listed with the ‘http’ protocol, by default. However, only web-based services can offer remote application access. Accessing the link for non-web-based services fails.

If an application does not have services, you will see the message No application services found in the deployment Details pane for that location.

The Remote Application Access confirmation dialog appears with the allotted time for accessing the remote service. When the time expires, users are unable to access the services, and all running sessions end.

You can click on Open to start or view the service or End to stop the remote application access.

The application service opens in a new browser tab. The following image shows an example remote service:

In some cases, the remote application may not open automatically in a browser tab. If the window does not open automatically, check the pop-up blocker for your web browser, and configure it to allow pop-ups from Fleet Command. Click the Open button to open the window manually.

If your application does not load correctly, make sure you follow the Configuring your Remote Application instructions.

Maximum Services

You can enable remote access for up to twenty services within an organization. If the maximum is reached, you will see a Remote Application Access Limit Reached dialog when starting a remote service. You must close a service that is not in use before you can open another.

Ending Remote Application Access

Fleet Command users with administrator or operator roles can end access to services for all users in the org.

To end access to remote application services, follow these steps:

Select Fleet Command > Deployments and then click the row for the deployment with the application.
In the Remote Application Access banner, click End to stop the service.

A confirmation window appears warning the action ends the connection for other users. Click End to proceed or Cancel to exit.

If there are no remaining open access sessions for this deployment, the banner will dismiss on its own.

Disabling Remote Application Access

Administrators can disable remote application access for all users in the organization.

Select Fleet Command > Settings.
Disable the Enable remote application access switch.

If remote services are in use, a confirmation window appears. Click Disable to end access for all users or Cancel to exit.

After remote access is disabled, users are unable to start any remote access for any deployments. The Service Name links in the Application Services tab are not clickable.

remote-app-access-deployment-details-links-unclickable.png

Security Overrides

If the option to allow security overrides in Fleet Command settings is enabled, additional configuration options are available that disable enforcement of specific security policies and enable the application to access more system hardware and software. These security overrides reduce system security and should be used at the administrator’s own risk. NVIDIA recommends limiting the use of security overrides and to apply them only for testing and troubleshooting to ensure maximum security of the system.

The following security overrides are available:

Enable all overrides
Allow system device access: This allows access to devices mounted at /dev.
Allow HostPaths: This enables access to any hostpath. If this option is disabled (by default), only /mnt, /opt, /tmp, and /etc/localtime are allowed.
Allow HostPath mount via PersistentVolumes: Enable the application to use hostPath for Persistent Volumes.
Allow HostNamespace (HostIPC and HostPID): Enable application access to the host system processes and inter-process interactions.
Allow Linux capabilities: Enable applications to use any Linux capabilities.
Allow PrivilegedContainers: This option allows privileged containers that run as root to run.

Security overrides cannot be changed when you edit an existing deployment. To change security overrides, remove the deployment and recreate it.

GitOps with Argo CD

About Deploying Applications with Argo CD

In addition to deploying applications with Helm and Helm charts, you can deploy applications declaritively using Git repositories. Fleet Command uses Argo CD to synchronize application state between a Git repository and edge systems.

Fleet Command configures Argo CD for the following abilities:

HTTPS and SSH connections to repositories.
Public and private repositories. For private repositories and SSH protocol access, Fleet Command stores credentials.
Support for the following tools:
- Kustomize
- Jsonnet
- Directories of Kubernetes manifests in YAML or JSON
- Helm charts

For both standard locations and high-availability locations, Fleet Command deploys a single replica of the required Argo CD components to conserve computing resources.

Enabling Argo CD

Support for Argo CD is initially disabled for all locations. You can enable support for Argo CD at a location at any time, but you cannot disable it for the location afterward.

Select Fleet Command > Locations.
On the Locations page, click the actions menu and select Edit Location.
On the Edit Location window, expand Advanced Settings, enable Argo CD, and click Save.

Adding a Git Application

Before you begin, check the following prerequisites:

You have at least one location is configured to enable Argo CD.
You have a Git repository with a directory path and branch that represents the target state of your application.

To add a Git application, perform the following steps:

Select Fleet Command > Applications.
On the Applications page, click Add Application.
Fill in the details for the application and then click Add Application.
- Display Name: This must be the unique name of the application.
- Description: Create a description for the new application.
- Source: Select Git Application.
- Git Repository URL: Specify an HTTPS, SSH, or GIT protocol URL to the Git repository.
- Git Repository Path: Specify the directory in the repository with the application to deploy.
- Git Reference: Specify a branch, tag, or commit.

Deploying a Git Application

After you add a Git application, you can deploy the application to one or more locations that enabled support for Argo CD.

Select Fleet Command > Deployments.
On the Deployments page, click Create Deployment.
On the Create Deployment page, fill in the details for the deployment.
- Deployment Name: Enter a name for the deployment.
- Source: Select Git.
- Application: Select an application for the deployment.
- Target Namespace: Specify a Kubernetes namespace for the deployment. The default value is default. You cannot edit this value after you create the deployment.
- Select Locations: Select the locations to deploy to.
- Security Overrides (Optional): Use this feature to override the default security settings for a deployment. Security overrides can only be selected during the initial deployment creation and cannot be modified afterwards without recreating the deployment. If you choose to apply security overrides, a message is shown on the Locations and Deployments pages. For detailed information on the options, refer to Security Overrides.
Select the Before Deploy checkbox and click Deploy.

Note

If multiple deployments with the same application are assigned to a single location, you must confirm that they do not conflict. Some examples to keep in mind are duplicate service ports, resources (over subscribing), duplicate applications, or other application conflicts within the same namespace. Ensure you search for the application rather than the container, model, or collections.

Managing Credentials for Git Repositories

You need to configure credentials for access to a Git repository if either of the following are true:

You want to use an SSH connection to the repository.
You want to use an HTTPS connection to a private repository that requires user and password or TLS client certificate authentication.

Credentials are associated with the repository URL. When you add an application and specify a repository URL that is associated with credentials, Argo CD uses the credentials to connect to the repository.

Select Fleet Command > Settings.
On the Settings page, navigate to the Git Repositories tile and click View Repositories.
On the Git Repositories page, click Configure Git Repositories to add credentials for a repository.
On the Configure Git Repository window, select the type of connection from the Repository Connection menu and then fill in the details about the credentials.
- Name: Enter a unique name for the repository credentials.
- Git Repository URL: Enter an HTTPS, SSH, or GIT protocol URL to the Git repository.
- SSH Private Key Data (SSH Connection): Enter a private key as plain text.
- Username (HTTPS): Enter the user name for accessing the repository.
- Password (HTTPS): Enter the password for the user.
- TLS Client Certificate (HTTPS): Enter the client certificate.
- TLS Client Certificate Key (HTTPS): Enter the private key.
- Enable LFS Support: Select the checkbox if the repository uses large file support (LFS).
- Proxy (HTTPS): Enter a proxy value, such as https://proxy.example.com:8888.

Repository Credentials in the Argo CD documentation.

Understanding Deployment Status for Git Applications

Deployments for Git applications use two status fields.

The Sync Status field indicates the current state of the application at the location compared to the target state of the application in the Git repository.

Sync Status	Description
Unknown	The synchronization status could not be reliably determined.
Sync	The current state of the application matches the target state in the Git repository.
OutOfSync	The target state in the Git repository drifted from the current state of the application.

The App Health field indicates the ability for the application to provide service.

App Health	Description
Unknown	The health assessment failed and the health is unknown.
Progressing	The application is not healthy, but might reach a healthy state.
Healthy	The application is 100% healthy and able to provide service.
Suspended	The resource is suspended or paused. This status typically applies to resources other than applications.
Degraded	The application did not reach a healthy state within a deadline threshold. One threshold to check is the `spec.progressDeadlineSeconds` value for the `Deployment` resource.
Missing	The resource is missing from the cluster.

Related Information

Core Concepts in the Argo CD documentation.

Multi-Instance GPU Configuration

Important

If NVIDIA Multi-Instance GPU (MIG) is enabled on the systems with an active deployment, the currently running applications are stopped during MIG configuration. If the Application Compatibility mode is On, the running applications restart after MIG is enabled. If Application Compatibility mode is Off, the applications might not restart successfully and require reconfiguration and redeployment. More information on Application Compatibility mode can be found later on, and information about configuring applications for MIG can be found in the Application Guide.

NVIDIA Multi-Instance GPU (MIG) enables partitioning a single GPU into multiple GPUs to distribute workloads across the multiple GPUs. Fleet Command can enable and configure MIG on supported GPUs from the web interface.

MIG capability is only supported on NVIDIA A30 and A100 GPUs. For unsupported GPUs, the MIG tab of the system details displays that no MIG-capable GPUs are present.

For information about MIG, refer to the NVIDIA Multi-Instance GPU User Guide.

To view additional information about MIG, click the Multi-Instance GPU (MIG) tab.

To configure MIG, select the GPU, select the MIG profile from the menu, select the checkbox to understand the application disruption, and click Save to apply the changes. After you apply the MIG configuration, wait several minutes for the web interface to reflect the changes on the system.

Fleet Command allows you to configure specific MIG profiles for A30 and A100. The supported MIG profiles are listed in the following table.

MIG Available Options	Supported GPUs
2 MIGs of 3c.20gb	A100
3 MIGs of 2c.10gb	A100
7 MIGs of 1c.5gb	A100
2 MIGs of 2c.12gb	A30
4 MIGs of 1c.6gb	A30

Introduction to the Application Compatibility Mode

Different configuration combinations for systems with more than one MIG-capable GPU can impact application compatibility mode.

If MIG is not enabled on all GPUs in a multi-GPU system, the application compatibility mode is set to Off, and a specific application configuration is required to use MIG.
If MIG is enabled on all GPUs, but with different MIG profile sizes, the application compatibility mode is set to Off.
If MIG is enabled on all GPUs and with the same MIG profile sizes, the application compatibility mode is set to On.

Refer to the Application Guide for additional information about the application compatibility mode and configuring applications for MIG.

Advanced Storage Configuration

Note

This feature is available only on NVIDIA-Certified x86 systems.

While installing the Fleet Command Discovery Image (ISO) onto edge systems, administrators can customize the utilization of physical storage devices attached to the system and designate storage areas where data can persist across OTA updates, system reboots and application lifecycle stages.

Note

Admins must customize configurations during the initial provisioning of the system. If any changes are required after provisioning, they must rerun the installation to choose a different configuration. Any changes made outside of the installation process will be overwritten on system reboot or an OTA update or result in the system being quarantined (the system is disabled and requires re-installation to use again).

Note

If admins choose the default installer options, the installer will select the largest drive for OS installation and assimilate all of the drives into the same data partition. This could lead to potential performance degradation.

Here are the available customization options:

select the disk to format and install the Fleet Command Stack and system OS
select the disk to format and store persistent application deployment data
select additional drives to format and mount to the system as separate mount points
specify custom aliases for mounted drives
choose alternative destinations for logs and container resources (running images, source images)

Unused space on the system OS disk will be used for application data; additional drives added will extend the application data to those disks as well in a single logical mount.

The following steps describe each customization option in detail. The drive serial numbers and attributes shown in the dialogs are sample data.

Installing Discovery Image (ISO)

You can select from the available drives to install the Fleet Command OS. The Fleet Command OS will take up approximately 9 GB of space; unused space on this drive will be used as the data partition.
Selecting additional drives for the data partition

You can designate additional drives to extend the data partition onto. These drives will be added to a logical mount point along with the unused space from Step 1.
Mounting additional drives in separate mount points

You can mount additional drives as separate mount points to be used for application deployment data, logs, and container images. In the following steps, you can allocate specific drives for the various data types.
Creating custom drive aliases

You can create unique aliases (similar to symbolic or soft links) for drives you have specified in the prior steps.
Storing system logs

You can choose to store system logs in a drive you selected in Step 3.
Storing application container resources

You can choose to store application container images in a drive you selected in Step 3.
Viewing storage configuration summary

In this dialog, you will find a summary of your drive selections and allocations.

Viewing Storage Configuration

There are two ways to view the storage configuration: the Fleet Command web interface and the NGC command-line interface (CLI).

To view the advanced storage configuration for a system, go to Fleet Command > Locations. Choose a location and click on the Details tab for a system:

The Storage Configuration pane shows the log and container storage paths and mount points for the additional drives you have selected during installation.

To view the advanced storage configuration using the CLI, issue the following command:

Copy
Copied!

            
            $ ngc fleet-command location info <location>[:<system>]

The following example shows the storage configuration for location fc-test-location and system fc-test-node with two mount points under Storage Configuration: /drives/02000000000000000001 and /drives/VMwareNVME_0000.

Copy
Copied!

            
            $ ngc fleet-command location info fc-test-location:fc-test-node --format_type ascii
----------------------------------------------------------------------------------------------------------------------------------------
System Information
    Name: fc-test-node
    Status: READY
    Marked For Delete: False
    Config: controller-worker
    Local IP: 172.31.44.243
    Description: nvidia
    Advanced Networking Details
        Default Gateway: 172.31.32.1
        Default Interface: ens33
        Host Name: fc-test-node.fc.nvda.co
        HTTP Proxy:
        HTTPS Proxy:
        No Proxy:
        Interface 1
            Name: ens33
            IP Addresses: 172.31.44.243/20, fe80::457:18ff:fea9:4db7/64

    Storage Configuration
        Additional Data Mount: /drives/02000000000000000001
            Type: HDD(sata)
            Used: 0.1GB
            Available: 5.4GB
        Additional Data Mount: /drives/VMwareNVME_0000
            Type: SSD(nvme)
            Used: 3.5GB
            Available: 5.4GB

Kubernetes Labels

Kubernetes clusters deployed at Fleet Command edge sites are enhanced to support node labels using NVIDIA GPU Feature Discovery software component. This component, which leverages Node Feature Discovery, allows you to generate Kubernetes node labels for the set of GPUs available on a node. Fleet Command applications can use these labels to steer and organize specific workloads using Kubernetes node selectors in the edge application pod specification.

A sample list of supported labels follows:

Feature labels

These labels are prefixed with ‘feature.node.kubernetes.io/’.

Copy
Copied!

            
            {
    "feature.node.kubernetes.io/cpu-<feature-name>": "true",
    "feature.node.kubernetes.io/custom-<feature-name>": "true",
    "feature.node.kubernetes.io/kernel-<feature name>": "<feature value>",
    "feature.node.kubernetes.io/memory-<feature-name>": "true",
    "feature.node.kubernetes.io/network-<feature-name>": "true",
    "feature.node.kubernetes.io/pci-<device label>.present": "true",
    "feature.node.kubernetes.io/storage-<feature-name>": "true",
    "feature.node.kubernetes.io/system-<feature name>": "<feature value>",
    "feature.node.kubernetes.io/usb-<device label>.present": "<feature value>",
    "feature.node.kubernetes.io/<file name>-<feature name>": "<feature value>"
}

GPU-specific labels

These labels are prefixed with ‘nvidia.com/gpu’ and ‘nvidia.com/cuda’.

Copy
Copied!

            
            {
    "nvidia.com/cuda.driver.major": "<cuda-driver-major-version>",
    "nvidia.com/cuda.driver.minor": "<cuda-driver-minor-version>",
    "nvidia.com/cuda.driver.rev": "<cuda-driver-revision>",
    "nvidia.com/cuda.runtime.major": "<cuda-runtime-major>",
    "nvidia.com/cuda.runtime.minor": "<cuda-runtime-minor>",
    "nvidia.com/gpu.compute.major": "<gpu-compute-major>",
    "nvidia.com/gpu.compute.minor": "<gpu-compute-minor>",
    "nvidia.com/gpu.count": "<gpu-count>",
    "nvidia.com/gpu.family": "<gpu-family>",
    "nvidia.com/gpu.machine": "<gpu-machine>",
    "nvidia.com/gpu.memory": "<gpu-memory-MiB>",
    "nvidia.com/gpu.product": "<gpu-product>"
}

You can find a set of recommended Labels on the Kubernetes website.

Viewing Labels

Kubernetes labels are available on the Fleet Command locations page.

To view the Kubernetes labels, go to Fleet Command > Locations.
Click on a location from the list to view the locations details page.
Click on the Details tab in the system details pane.

You will see a list of default Kubernetes labels.

Searching Labels

To search for a label, enter a search term in the search field and press Enter.

To search on a label by key or value, click on the filter icon, enter a search term in the search field, and press Enter.

This example shows searching by the key “beta”:

This example shows searching by the value “true”:

High Availability Locations

Fleet Command supports high availability Kubernetes clusters on edge locations. High availability ensures systems operate continuously for a specified period by eliminating single points of failure or employing redundancy. As a result, users of your applications and services can experience fewer disruptions and minimal downtime. When the high availability option is set for a Fleet Command location, the first three systems created will be assigned the controller-worker role, and any additional systems as worker roles.

High availability comes into play as resources become available. Once a location is set to be high availability, it remains in effect for the lifetime of the location object.

Note

If a system goes offline, you must wait for the system to be back online before deploying the location. NVIDIA recommends using static assignment in DHCP for systems in high-availability locations to minimize downtime when systems encounter IP changes.