Deployment Best Practices
The most successful deployments strike a balance between user density (scalability) and quality user experience. This is achieved when vPC virtual machines are used in production while objective measurements for adequate density sizing and subjective feedback from end-user’s experience are gathered.
NVIDIA vPC on NVIDIA GPUs provides extensive monitoring features, enabling IT to use the various engines of an NVIDIA GPU. The utilization of the compute engine, the frame buffer, the encoder, and the decoder can all be monitored and logged through a command-line interface tool nvidia-smi, accessed on the hypervisor or within the virtual machine. In addition, NVIDIA vGPU metrics are integrated with Windows Performance Monitor (PerfMon) and through management packs like VMware vRealize Operations.
To identify bottlenecks of individual end-users or the physical GPU serving multiple end-users, execute the following nvidia-smi commands on the hypervisor.
Virtual Machine Frame Buffer Utilization:
nvidia-smi vgpu -q -l 5 | grep -e "VM ID" -e "VM Name" -e "Total" -e "Used" -e "Free"
Virtual Machine GPU, Encoder and Decoder Utilization:
nvidia-smi vgpu -q -l 5 | grep -e "VM ID" -e "VM Name" -e "Utilization" -e "Gpu" -e "Encoder" -e "Decoder"
Physical GPU, Encoder and Decoder Utilization:
nvidia-smi -q -d UTILIZATION -l 5 | grep -v -e "Duration" -e "Number" -e "Max" -e "Min" -e "Avg" -e "Memory" -e "ENC" -e "DEC" -e "Samples"
Another benefit of performing a POC before deployment is that it enables a more accurate categorization of user behavior and GPU requirements for each virtual workstation. Customers often segment their end-users into user types for each application and bundle similar user types on a host. Light users can be supported on a smaller GPU and smaller profile size, while heavy users require more GPU resources, large profile size, and, may be best kept on an upgraded vGPU license like NVIDIA RTX Virtual Workstation (RTX vWS).
Benchmarks like nVector can be used to help size a deployment, but they have some limitations. The nVector benchmarks simulate peak workloads with the highest demand for GPU resources across all virtual machines. The benchmark does not account for the times when the system is not fully utilized. Hypervisors and the best effort scheduling policy can be leveraged to achieve higher user densities with consistent performance.
The graphic below demonstrates how workflows processed by end-users are typically interactive, which means there are multiple short idle breaks when users require less performance and resources from the hypervisor and NVIDIA vGPU. The degree to which higher scalability is achieved depends on your users’ typical day-to-day activities, such as the number of meetings and the length of lunch or breaks, multi-tasking, etc. It is recommended to test and validate your internal workloads to meet the needs of your users.
Benchmark Testing Typical End User
NVIDIA used the nVector benchmarking engine to conduct vGPU testing at scale. This benchmarking engine automates the testing process from provisioning virtual machines, establishing remote connections, executing KW workflow, and analyzing the results across all virtual machines. Test results shown in this application guide are based on the nVector KW benchmarks run in parallel on all virtual machines with metrics averaged.
When done well, benchmarking can improve an organization’s processes and overall performance. In the context of benchmarking, a channel refers to the specific path or method through which data or workloads are processed by the GPU. Understanding and defining channels is crucial as it directly impacts the performance outcomes measured in benchmark tests, making it a significant sizing factor for GPU deployments. However, there are a plethora of pitfalls to the use of benchmark testing, especially social benchmarking, as the results of an organization’s internal POC may yield different results than the social benchmark. This is due to a variety of reasons such as configuration differences, unreliable or incomplete data, lack of proper framework for standardized testing, etc. Conducting an internal POC with benchmark testing is valuable, but it can provide an incomplete picture when not paired with organizational strategy and goals.
NVIDIA vPC provides three GPU scheduling options to accommodate a variety of QoS requirements of customers. Additional information regarding GPU scheduling can be found here.
- Equal share scheduler
- Fixed share scheduler
The physical GPU is shared equally amongst the running vGPUs that reside on it. As vGPUs are added to or removed from a GPU, the share of the GPU’s processing cycles allocated to each vGPU changes accordingly. As a result, the performance of a vGPU may increase as other vGPUs on the same GPU are stopped, or decrease as other vGPUs are started on the same GPU.
Each vGPU is given a fixed share of the physical GPU’s processing cycles, the amount of which depends on the vGPU type, which in turn determines the maximum number of vGPUs per physical GPU. For example, the maximum number of T4-4Q vGPUs per physical GPU is 4. When the scheduling policy is fixed share, each T4-4Q vGPU is given one quarter, or 25%, the physical GPU’s processing cycles. As vGPUs are added to or removed from a GPU, the share of the GPU’s processing cycles allocated to each vGPU remains constant. As a result, the performance of a vGPU remains unchanged as other vGPUs are stopped or started on the same GPU.
In addition to the default best effort scheduler, GPUs based on NVIDIA GPU architectures after the Maxwell architecture support equal share and fixed share vGPU schedulers.
A physical GPU has a fixed number of channels. The number of channels allocated to each vGPU is proportional to the maximum number of vGPUs allowed on the physical GPU. Issues occur when the channels allocated to a vGPU are exhausted, and the guest VM to which the vGPU is assigned fails to allocate a channel to the vGPU.
To remove channel errors, use a vGPU type with more frame buffer, which reduces the maximum number of vGPUs allowed on the physical GPU. As a result, the number of channels allocated to each vGPU is increased.
Ampere and later GPUs have 2048 channels per GPU engine (compute/graphics, copy engine, video encoders/decoders).
The number of channels per vGPU on Ampere and later is given by this formula:
(biggest power of 2 smaller than (2048 / max instances for the vGPU’s type)) – 3.
For example, the maximum number of 1B vGPU profiles on an NVIDIA L4 is 24. The 2048 channels are divided by 24, which provides 85 channels, and the biggest power of 2 that’s smaller than 85 is 64, which provides 64 channels per VM before overhead. Next, 3 channels are subtracted from the 64 channels, which offers 61 channels per VM.
(biggest power of 2 smaller than (2048/24)) - 3 = 61 channels per VM
Example channel allocation per GPU Profile:
GPU profile |
Channels per VM |
---|---|
A16-1B | 125 channels per VM (16 x 4 card = 64 max) |
A16-2B | 252 channels per VM |
L4-1B | 61 channels to each VM |
L4-2B | 125 channels to each VM |
Channel utilization is related to single and multi-threaded applications running in a vGPU VM in addition to OS and boot overhead. Therefore, different apps will have different channel utilization.
When the channels allocated to a vGPU are exhausted, and the guest VM fails to allocate a channel, the following errors are reported on the hypervisor host or in an NVIDIA bug report:
Jun 26 08:01:25 srvxen06f vgpu-3[14276]: error: vmiop_log: (0x0): Guest attempted to allocate channel above its max channel limit 0xfb
Jun 26 08:01:25 srvxen06f vgpu-3[14276]: error: vmiop_log: (0x0): VGPU message 6 failed, result code: 0x1a
Jun 26 08:01:25 srvxen06f vgpu-3[14276]: error: vmiop_log: (0x0): 0xc1d004a1, 0xff0e0000, 0xff0400fb, 0xc36f,
Jun 26 08:01:25 srvxen06f vgpu-3[14276]: error: vmiop_log: (0x0): 0x1, 0xff1fe314, 0xff1fe038, 0x100b6f000, 0x1000,
Jun 26 08:01:25 srvxen06f vgpu-3[14276]: error: vmiop_log: (0x0): 0x80000000, 0xff0e0200, 0x0, 0x0, (Not logged),
Jun 26 08:01:25 srvxen06f vgpu-3[14276]: error: vmiop_log: (0x0): 0x1, 0x0
Jun 26 08:01:25 srvxen06f vgpu-3[14276]: error: vmiop_log: (0x0): , 0x0