Deployment Best Practices

Understand Your Environment

IT infrastructure is highly complex, involving multiple server types with varying CPUs, memory, storage, and networking resources. Deployments often involve a geographically dispersed user base, multiple data centers, and a blend of cloud-based compute and storage resources. It is crucial to define the scope of your deployment around these variables and conduct a proof of concept (POC) for each deployment type.

Other factors to consider include the NVIDIA vGPU certified OEM server you’ve selected, the supported NVIDIA GPUs for that platform, and any power and cooling constraints in your data center. For further information regarding installation and server configuration steps, please refer to the NVIDIA vGPU on VMware vSphere or Citrix Hypervisor Deployment Guide.

Run a Proof of Concept

The most successful deployments balance user density (scalability) with quality user experience. This balance is achieved by using NVIDIA RTX vWS virtual machines in production while gathering objective measurements and subjective feedback from end users.

Table 20 Metrics for Balancing User Density and User Experience

Objective Measurements	Subjective Feedback
Loading time of application	Overall user experience
Loading time of dataset	Application performance
Utilization (CPU, GPU, network)	Zooming and panning experience
Frames Per Second (FPS)	Video streaming

Leverage Management and Monitoring Tools

As discussed in previous chapter, several NVIDIA specific and third-party industry tools can help validate your deployment and ensure it provides an acceptable end-user experience and optimal density. Failure to leverage these tools can result in unnecessary risk and poor end-user experience.

Understand Your Users & Applications

Another benefit of performing a POC prior to deployment is that it enables more accurate categorization of user behavior and GPU requirements for each virtual application. Customers often segment their end users into user types for each application and bundle similar user types on a host. Light users can be supported on a smaller vGPU profile size while heavy users require more GPU resources, and a large profile size like what can be achieved with the L40S .

Note that while the NVIDIA A16 board has a total framebuffer size of 64GB, each GPU on the A16 has 16GB, so the largest profile size supported on an A16 is 16Q. However, the L40S has one GPU on a board supporting up to a 48Q profile size. Work with your application ISV and NVIDIA representative to help you determine the correct license(s) and NVIDIA GPUs for your deployment needs.

Use Benchmark Testing

Benchmark tools like SPECviewperf are valuable for sizing deployments but have limitations. These benchmarks simulate peak workloads, representing periods of highest GPU demand across all virtual machines. They do not account for times when the system is underutilized, nor for hypervisor features like best-effort scheduling, which can enhance user density while maintaining consistent performance.

The graph below illustrates that user workflows are often interactive, characterized by frequent short idle periods when users require fewer hypervisor and NVIDIA vGPU resources. The extent to which scalability is increased depends on typical user activities such as meetings, breaks, multitasking, and other factors.

Figure 12 Comparison of benchmarking versus typical end user

Note

For accurate benchmarking, it is recommended to disable the Frame Rate Limiter (FRL). For detailed instructions on how to disable the FRL, please refer to the release notes for your chosen hypervisor in the NVIDIA Virtual GPU Software Documentation.

Understanding the GPU Scheduler

NVIDIA RTX vWS offers three GPU scheduling options tailored to meet various Quality of Service (QoS) requirements for customers. Additional information regarding GPU scheduling can be found here.

Fixed Share Scheduling: Ensures consistent, dedicated QoS to each vGPU on the same physical GPU, based on predefined time slices. This simplifies POCs by enabling the use of common benchmarks like SPECviewperf to compare performance between physical and virtual workstations.
Best Effort Scheduling: Provides consistent performance at higher scalability, reducing Total Cost of Ownership (TCO) per user. This scheduler uses a round-robin algorithm to share GPU resources based on real-time demand, allocating time slices dynamically. It optimizes resource utilization during idle periods for enhanced user density and QoS.
Equal Share Scheduling: Allocates equal GPU resources to each running VM by distributing time slices evenly. This approach adjusts resource allocation dynamically as vGPUs are added or removed, boosting performance during low utilization periods and balancing resources during high demand.

Organizations typically adopt the best effort GPU scheduler policy to maximize GPU utilization, supporting more users per server with lower QoS and better TCO per user.

Figure 13 Comparison of VMs Per GPU performance Utilization Based on Dedicated Performance vs Best Effort Configs

Understanding GPU Channels

GPU channels are dedicated communication pathways that allow applications and system processes to interact with the GPU for accelerated computing. Each vGPU instance is allocated a specific number of GPU channels, which are consumed as applications request GPU resources.

Channel allocation per vGPU is designed based on the maximum number of vGPU instances that can run concurrently on a physical GPU. Larger profiles typically have more GPU channels, which can help improve multitasking and application stability.

When all vGPU channels are consumed, applications may fail to launch or may crash, and system responsiveness degrades. The following errors are reported on the hypervisor host or in an NVIDIA bug report, when running vGPU 16.10, 17.6 and 18.1 and later releases:

Copy
Copied!

            
            Jun 26 08:01:25 srvxen06f vgpu-3[14276]: error: vmiop_log: (0x0): Guest attempted to allocate channel above its max channel limit 0xfb

Monitoring GPU Channel Usage During POC

By monitoring GPU channel allocation, users can prevent channel exhaustion, optimize vGPU deployments, and ensure stable performance in virtualized environments.

The channel_usage_threshold_percentage plugin parameter helps detect when workloads approach GPU channel exhaustion for specific hypervisors.

Setting a threshold allows administrators to receive warnings when channel usage surpasses a defined percentage.

By default, channel usage warnings are disabled, but administrators can enable them by setting a threshold percentage. For KVM hypervisors, the plugin parameter can be configured as follows:

Copy
Copied!

            
            echo "channel_usage_threshold_percentage=<percentage>" >
/sys/bus/mdev/devices/<UUID>/nvidia/vgpu_params

For example, to set the GPU channel usage warning threshold to 80%, run the following command:

Copy
Copied!

            
            echo "channel_usage_threshold_percentage=80" >
/sys/bus/mdev/devices/<UUID>/nvidia/vgpu_params

Note

The above instructions are specific to KVM hypervisors. Path and configuration methods differ on other hypervisors. For more information, follow the the hypervisor-specific setup steps:

When running vGPU 16.10, 17.6 and 18.1 and later releases, once the usage surpasses the threshold, warning messages appear in the logs, indicating that channel utilization is approaching exhaustion. Example log output:

Copy
Copied!

            
            Sep 10 08:39:52 smc120-0003 nvidia-vgpu-mgr[313728]: notice: vmiop_log: Guest current channel usage 81% on engine 0x1 exceeds threshold channel usage 80%

This feature is particularly useful during Proof-of-Concept (PoC) deployments to observe and optimize resource allocation before production deployment. Through proactive monitoring, administrators can detect potential channel exhaustion early, preventing system crashes and performance degradation by identifying workloads that consume excessive GPU channels. This insight allows for timely adjustments before issues escalate.

Note

This channel threshold monitoring POC cannot be carried out on Ada and Hopper vGPU types.