NVIDIA RTX vWS: Sizing and GPU Selection Guide for Virtualized Workloads
NVIDIA RTX vWS: Sizing and GPU Selection Guide for Virtualized Workloads

Deployment Best Practices

IT infrastructure is highly complex, involving multiple server types with varying CPUs, memory, storage, and networking resources. Deployments often involve a geographically dispersed user base, multiple data centers, and a blend of cloud-based compute and storage resources. It is crucial to define the scope of your deployment around these variables and conduct a proof of concept (POC) for each deployment type.

Other factors to consider include the NVIDIA vGPU certified OEM server you’ve selected, the supported NVIDIA GPUs for that platform, and any power and cooling constraints in your data center. For further information regarding installation and server configuration steps, please refer to the NVIDIA vGPU on VMware vSphere or Citrix Hypervisor Deployment Guide.

The most successful deployments balance user density (scalability) with quality user experience. This balance is achieved by using NVIDIA RTX vWS virtual machines in production while gathering objective measurements and subjective feedback from end users.

Table 20 - Metrics for Balancing User Density and User Experience

Objective Measurements

Subjective Feedback

Loading time of application Overall user experience
Loading time of dataset Application performance
Utilization (CPU, GPU, network) Zooming and panning experience
Frames Per Second (FPS) Video streaming

As discussed in previous chapter, several NVIDIA specific and third-party industry tools can help validate your deployment and ensure it provides an acceptable end-user experience and optimal density. Failure to leverage these tools can result in unnecessary risk and poor end-user experience.

Another benefit of performing a POC prior to deployment is that it enables more accurate categorization of user behavior and GPU requirements for each virtual application. Customers often segment their end users into user types for each application and bundle similar user types on a host. Light users can be supported on a smaller vGPU profile size while heavy users require more GPU resources, and a large profile size like what can be achieved with the L40S .

Note that while the NVIDIA A16 board has a total framebuffer size of 64GB, each GPU on the A16 has 16GB, so the largest profile size supported on an A16 is 16Q. However, the L40S has one GPU on a board supporting up to a 48Q profile size. Work with your application ISV and NVIDIA representative to help you determine the correct license(s) and NVIDIA GPUs for your deployment needs.

Benchmark tools like SPECviewperf are valuable for sizing deployments but have limitations. These benchmarks simulate peak workloads, representing periods of highest GPU demand across all virtual machines. They do not account for times when the system is underutilized, nor for hypervisor features like best-effort scheduling, which can enhance user density while maintaining consistent performance.

The graph below illustrates that user workflows are often interactive, characterized by frequent short idle periods when users require fewer hypervisor and NVIDIA vGPU resources. The extent to which scalability is increased depends on typical user activities such as meetings, breaks, multitasking, and other factors.

vgpu-013.png

Figure 13 - Comparison of benchmarking versus typical end user

Note

For accurate benchmarking, it is recommended to disable the Frame Rate Limiter (FRL). For detailed instructions on how to disable the FRL, please refer to the release notes for your chosen hypervisor in the NVIDIA Virtual GPU Software Documentation.

NVIDIA RTX vWS offers three GPU scheduling options tailored to meet various Quality of Service (QoS) requirements for customers. Additional information regarding GPU scheduling can be found here.

  • Fixed Share Scheduling: Ensures consistent, dedicated QoS to each vGPU on the same physical GPU, based on predefined time slices. This simplifies POCs by enabling the use of common benchmarks like SPECviewperf to compare performance between physical and virtual workstations.

  • Best Effort Scheduling: Provides consistent performance at higher scalability, reducing Total Cost of Ownership (TCO) per user. This scheduler uses a round-robin algorithm to share GPU resources based on real-time demand, allocating time slices dynamically. It optimizes resource utilization during idle periods for enhanced user density and QoS.

  • Equal Share Scheduling: Allocates equal GPU resources to each running VM by distributing time slices evenly. This approach adjusts resource allocation dynamically as vGPUs are added or removed, boosting performance during low utilization periods and balancing resources during high demand.

Organizations typically adopt the best effort GPU scheduler policy to maximize GPU utilization, supporting more users per server with lower QoS and better TCO per user.

vgpu-014.png

Figure 14 - Comparison of VMs Per GPU performance Utilization Based on Dedicated Performance vs Best Effort Configs

A physical GPU has a fixed number of channels. The number of channels allocated to each vGPU profile is proportional to the profile’s size relative to the GPU. Issues occur when the channels allocated to a vGPU are exhausted, and the guest VM to which the vGPU is assigned fails to allocate a channel to the vGPU.

To avoid channel errors, use a vGPU profile with more frame buffer, as a result of which more channels will be available. Note that increasing the vGPU size would mean fewer number of its instances supported on the physical GPU. As a result, the number of channels allocated to each vGPU is increased.

Ampere and later GPUs have 2048 channels per GPU engine (compute/graphics, copy engine, video encoders/decoders).

The number of channels per vGPU on Ampere and later is given by this formula:

(biggest power of 2 smaller than (2048 / max instances for the vGPU’s type)) – 3.

For example, the maximum number of 1B vGPU profiles on an NVIDIA L4 is 24. The 2048 channels are divided by 24, which provides 85 channels, and the biggest power of 2 that’s smaller than 85 is 64, which provides 64 channels per VM before overhead. Next, 3 channels are subtracted from the 64 channels, which offers 61 channels per L4-1B vGPU instance.

(biggest power of 2 smaller than (2048/24)) - 3 = 61 channels per L4-1B vGPU instance.

Example channel allocation per GPU Profile:

Table 21 - Example Channel Allocation per GPU Profile

GPU Profile

Channels per vGPU instance

A16-1B 125 channels per A16-1B vGPU instance (16 x 4 card = 64 max)
A16-2B 253 channels per A16-2B vGPU instance
L4-1B 61 channels per L4-1B vGPU instance
L4-2B 125 channels per L4-2B vGPU instance

Channel utilization is related to single and multi-threaded applications running in a vGPU VM in addition to OS and boot overhead. Therefore, different apps will have different channel utilization.

When the channels allocated to a vGPU are exhausted, and the guest VM fails to allocate a channel, the following errors are reported on the hypervisor host or in an NVIDIA bug report:

Copy
Copied!
            

Jun 26 08:01:25 srvxen06f vgpu-3[14276]: error: vmiop_log: (0x0): Guest attempted to allocate channel above its max channel limit 0xfb Jun 26 08:01:25 srvxen06f vgpu-3[14276]: error: vmiop_log: (0x0): VGPU message 6 failed, result code: 0x1a Jun 26 08:01:25 srvxen06f vgpu-3[14276]: error: vmiop_log: (0x0): 0xc1d004a1, 0xff0e0000, 0xff0400fb, 0xc36f, Jun 26 08:01:25 srvxen06f vgpu-3[14276]: error: vmiop_log: (0x0): 0x1, 0xff1fe314, 0xff1fe038, 0x100b6f000, 0x1000, Jun 26 08:01:25 srvxen06f vgpu-3[14276]: error: vmiop_log: (0x0): 0x80000000, 0xff0e0200, 0x0, 0x0, (Not logged), Jun 26 08:01:25 srvxen06f vgpu-3[14276]: error: vmiop_log: (0x0): 0x1, 0x0 Jun 26 08:01:25 srvxen06f vgpu-3[14276]: error: vmiop_log: (0x0): , 0x0

Previous Example VDI Deployment Configurations
Next Conclusion
© Copyright © 2024, NVIDIA Corporation. Last updated on Oct 3, 2024.