When to Use MPS#

The Benefits of MPS#

GPU Utilization#

A single process may not utilize all the compute and memory-bandwidth capacity available on the GPU. MPS allows kernel and memcopy operations from different processes to overlap on the GPU, achieving higher utilization and shorter running times.

Reduced On-GPU Context Storage#

Without MPS, each CUDA processes using a GPU allocates separate storage and scheduling resources on the GPU. In contrast, the MPS server allocates one copy of GPU storage and scheduling resources shared by all its clients. Volta MPS supports increased isolation between MPS clients, so the resource reduction is to a much lesser degree.

Reduced GPU Context Switching#

Without MPS, when processes share the GPU their scheduling resources must be swapped on and off the GPU. The MPS server shares one set of scheduling resources between all of its clients, eliminating the overhead of swapping when the GPU is scheduling between those clients.

Identifying Candidate Applications#

MPS is useful when each application process does not generate enough work to saturate the GPU. Multiple processes can be run per node using MPS to enable more concurrency. Applications like this are identified by having a small number of blocks-per-grid.

Further, if the application shows a low GPU occupancy because of a small number of threads-per-grid, performance improvements may be achievable with MPS.Using fewer blocks-per-grid in the kernel invocation and more threads-per-block to increase the occupancy per block is recommended. MPS allows the leftover GPU capacity to be occupied with CUDA kernels running from other processes.

These cases arise in strong-scaling situations, where the compute capacity (node, CPU core and/or GPU count) is increased while the problem size is held fixed. Though the total amount of computation work stays the same, the work per process decreases and may underutilize the available compute capacity while the application is running. With MPS, the GPU will allow kernel launches from different processes to run concurrently and remove an unnecessary point of serialization from the computation.

Considerations#

System Considerations#

Limitations#

  • MPS is only supported on the Linux and QNX operating systems. The MPS server will fail to start when launched on an operating system other than Linux.

  • Only Volta MPS is supported on Tegra platforms.

  • MPS requires a GPU with compute capability version 3.5 or higher. The MPS server will fail to start if one of the GPUs visible after applying CUDA_VISIBLE_DEVICES is not of compute capability 3.5 or higher.

  • The Unified Virtual Addressing (UVA) feature of CUDA must be available, which is the default for any 64-bit CUDA program running on a GPU with compute capability version 2.0 or higher. If UVA is unavailable, the MPS server will fail to start.

  • The amount of page-locked host memory that can be allocated by MPS clients is limited by the size of the tmpfs filesystem (/dev/shm for Linux and /dev/shmem for QNX).

  • Exclusive-mode restrictions are applied to the MPS server, not MPS clients. GPU compute modes are not supported on Tegra platforms.

  • Only one user on a system may have an active MPS server.

  • The MPS control daemon will queue MPS server activation requests from separate users, leading to serialized exclusive access of the GPU between users regardless of GPU exclusivity settings.

  • All MPS client behavior will be attributed to the MPS server process by system monitoring and accounting tools (for example, nvidia-smi, NVML API).

GPU Compute Modes#

Three Compute Modes are supported via settings accessible in nvidia-smi:

  • PROHIBITED – the GPU is not available for compute applications.

  • EXCLUSIVE_PROCESS — the GPU is assigned to only one process at a time, and individual process threads may submit work to the GPU concurrently.

  • DEFAULT – multiple processes can use the GPU simultaneously. Individual threads of each process may submit work to the GPU simultaneously.

Using MPS effectively causes EXCLUSIVE_PROCESS mode to behave like DEFAULT mode for all MPS clients. MPS will always allow multiple clients to use the GPU via the MPS server.

When using MPS it is recommended to use EXCLUSIVE_PROCESS mode to ensure that only a single MPS server is using the GPU, which provides additional insurance that the MPS server is the single point of arbitration between all CUDA processes for that GPU.

Application Considerations#

  • The NVIDIA Codec SDK: https://developer.nvidia.com/nvidia-video-codec-sdk is not supported under MPS on pre-Volta MPS clients.

  • Only 64-bit applications are supported. The MPS server will fail to start if the CUDA application is not 64-bit. The MPS client will fail CUDA initialization.

  • If an application uses the CUDA driver API, then it must use headers from CUDA 4.0 or later (that is, it must not have been built by setting CUDA_FORCE_API_VERSION to an earlier version). Context creation in the client will fail if the context version is older than 4.0.

  • Dynamic parallelism is not supported. CUDA module load will fail if the module uses dynamic parallelism features.

  • MPS server only supports clients running with the same UID as the server. The client application will fail to initialize if the server is not running with the same UID. Volta MPS may be launched in -multiuser-server mode to allow clients under different UIDs to connect to a single MPS server launched under the root user while dropping isolation between users. Refer to Server for details regarding -multiuser-server mode.

  • Stream callbacks are not supported on pre-Volta MPS clients. Calling any stream callback APIs will return an error.

  • CUDA graphs with host nodes are not supported under MPS on pre-Volta MPS clients.

  • The amount of page-locked host memory that pre-Volta MPS client applications can allocate is limited by the size of the tmpfs filesystem (/dev/shm for Linux and /dev/shmem for QNX). Attempting to allocate more page-locked memory than the allowed size using any of relevant CUDA APIs will fail.

  • Terminating an MPS client without synchronizing with all outstanding GPU work (via Ctrl-C / program exception such as segfault / signals, etc.) can leave the MPS server and other MPS clients in an undefined state, which may result in hangs, unexpected failures, or corruptions.

  • CUDA IPC between CUDA contexts which are created by processes running as MPS clients and CUDA contexts which are created by processes not running as MPS clients is supported under Volta MPS. CUDA IPC is not supported on Tegra platforms.

  • Launching cooperative group kernel with MPS is not supported on Tegra platforms.

Memory Protection and Error Containment#

MPS is only recommended for running cooperative processes effectively acting as a single application, such as multiple ranks of the same MPI job, such that the severity of the following memory protection and error containment limitations is acceptable.

Memory Protection#

Volta MPS client processes have fully isolated GPU address spaces.

Pre-Volta MPS client processes allocate memory from different partitions of the same GPU virtual address space. As a result:

  • An out-of-range write in a CUDA Kernel can modify the CUDA-accessible memory state of another process and will not trigger an error.

  • An out-of-range read in a CUDA Kernel can access CUDA-accessible memory modified by another process, and will not trigger an error, leading to undefined behavior.

This pre-Volta MPS behavior is constrained to memory accesses from pointers within CUDA Kernels. Any CUDA API restricts MPS clients from accessing any resources outside of that MPS Client’s memory partition. For example, it is not possible to overwrite another MPS client’s memory using the cudaMemcpy() API.

Error Containment#

Volta MPS supports limited error containment:

  • A fatal GPU fault generated by a Volta MPS client process will be contained within the subset of GPUs shared between all clients with the fatal fault-causing GPU.

  • A fatal GPU fault generated by a Volta MPS client process will be reported to all the clients running on the subset of GPUs in which the fatal fault is contained, without indicating which client generated the error. Note that it is the responsibility of the affected clients to exit after being informed of the fatal GPU fault.

  • Clients running on other GPUs remain unaffected by the fatal fault and will run as normal until completion.

  • Once a fatal fault is observed, the MPS server will wait for all the clients associated with the affected GPUs to exit, prohibiting new client connecting to those GPUs from joining. The status of the MPS server changes from ACTIVE to FAULT. When all the existing clients associated with the affected GPUs have exited, the MPS server will recreate the GPU contexts on the affected GPUs and resume processing client requests to those GPUs. The MPS server status changes back to ACTIVE, indicating that it is able to process new clients.

For example, if your system has devices 0, 1, and 2, and if there are four clients client A, client B, client C, and client D connected to the MPS server: client A runs on device 0, client B runs on device 0 and 1, client C runs on device 1, client D runs on device 2. If client A triggers a fatal GPU fault:

  • Since device 0 and device 1 share a comment client, client B, the fatal GPU fault is contained within device 0 and 1.

  • The fatal GPU fault will be reported to all the clients running on device 0 and 1, that is, client A, client B, and client C.

  • Client D running on device 2 remain unaffected by the fatal fault and continue to run as normal.

  • The MPS server will wait for client A, client B, and client C to exit and reject any new client requests will be rejected with error CUDA_ERROR_MPS_SERVER_NOT_READY while the server status is FAULT. After client A, client B, and client C have exited, the server recreates the GPU contexts on device 0 and device 1 and then resumes accepting client requests on all devices. The server status becomes ACTIVE again.

Information about the fatal GPU fault containment will be logged, including:

  • If the fatal GPU fault is a fatal memory fault, the PID of the client which triggered the fatal GPU memory fault.

  • The device IDs of the devices which are affected by this fatal GPU fault.

  • The PIDs of the clients which are affected by this fatal GPU fault. The status of each affected client becomes INACTIVE and the status of the MPS server becomes FAULT.

  • The messages indicating the successful recreation of the affected devices after all the affected clients have exited.

Pre-Volta MPS client processes share on-GPU scheduling and error reporting resources. As a result:

  • A GPU fault generated by any client will be reported to all clients, without indicating which client generated the error.

  • A fatal GPU fault triggered by one client will terminate the MPS server and the GPU activity of all clients.

CUDA API errors generated on the CPU in the CUDA Runtime or CUDA Driver are delivered only to the calling client.

MPS on Multi-GPU Systems#

The MPS server supports using multiple GPUs. On systems with more than one GPU, you can use CUDA_VISIBLE_DEVICES to enumerate the GPUs you would like to use. Refer to Environment Variables for more details.

On systems with a mix of Volta / pre-Volta GPUs, if the MPS server is set to enumerate any Volta GPU, it will discard all pre-Volta GPUs. In other words, the MPS server will either operate only on the Volta GPUs and expose Volta capabilities or operate only on pre-Volta GPUs.

Performance#

Client-Server Connection Limits#

The pre-Volta MPS Server supports up to 16 client CUDA contexts per-device concurrently. Volta MPS server supports 60 client CUDA contexts per-device. This is increased from 48 client CUDA contexts per-device limit on CUDA 13.0 and prior. These contexts may be distributed over multiple processes. If the connection limit is exceeded, the CUDA application will fail to create a CUDA Context and return an API error from cuCtxCreate() or the first CUDA Runtime API call that triggers context creation. Failed connection attempts will be logged by the MPS server.

Volta MPS Execution Resource Provisioning#

Volta MPS supports limited execution resource provisioning. The client contexts can be set to only use a portion of the available threads. The provisioning capability is commonly used to achieve two goals:

  • Reduce client memory footprint: Since each MPS client process has fully isolated address space, each client context allocates independent context storage and scheduling resources. Those resources scale with the amount of threads available to the client. By default, each MPS client has all available threads useable. As MPS is usually used with multiple processes running simultaneously, making all threads accessible to every client is often unnecessary, and therefore wasteful to allocate full context storage. Reducing the number of threads available will effectively reduce the context storage allocation size.

  • Improve QoS: The provisioning mechanism can be used as a classic QoS mechanism to limit available compute bandwidth. Reducing the portion of available threads will also concentrate the work submitted by a client to a set of SMs, reducing destructive interference with other clients’ submitted work.

Setting the limit does not reserve dedicated resources for any MPS client context. It simply limits how much resources can be used by a client context. Kernels launched from different MPS client contexts may execute on the same SM, depending on load-balancing.

By default, each client is provisioned to have access to all available threads. This will allow the maximum degree of scheduling freedom, but at a cost of higher memory footprint due to wasted execution resource allocation. The memory usage of each client process can be queried through nvidia-smi.

The provisioning limit can be set via a few different mechanisms for different effects. These mechanisms are categorized into two mechanisms: active thread percentage and programmatic interface. In particular, partitioning via active thread percentage are categorized into two strategies: uniform partitioning and non-uniform partitioning.

The limit constrained by the uniform active thread percentage is configured for a client process when it starts and cannot be changed for the client process afterwards. The executed limit is reflected through device attribute cudaDevAttrMultiProcessorCount whose value remains unchanged throughout the client process.

  • The MPS control utility provides 2 sets of commands to set/query the limit of all future MPS clients. Refer to nvidia-cuda-mps-control for more details.

  • Alternatively, the limit for all future MPS clients can be set by setting the environment variable CUDA_MPS_ACTIVE_THREAD_PERCENTAGE for the MPS control process. Refer to MPS Control Daemon Level for more details.

  • The limit can be further constrained for new clients by solely setting the environment variable CUDA_MPS_ACTIVE_THREAD_PERCENTAGE for a client process. Refer to Client Process Level for more details.

The limit constrained by the non-uniform active thread percentage is configured for every client CUDA context and can be changed throughout the client process. The executed limit is reflected through device attribute cudaDevAttrMultiProcessorCount whose value returns the portion of available threads that can be used by the client CUDA context current to the calling thread.

The limit constrained by the programmatic partitioning is configured for a client CUDA context created via cuCtxCreate_v3() with the execution affinity CUexecAffinityParam which specifies the number of SMs that the context is limited to use. The executed limit of the context can be queried through cuCtxGetExecAffinity(). Refer to Best Practice for SM Partitioning for more details.

A common provisioning strategy is to uniformly partition the available threads equally to each MPS client processes (i.e., set active thread percentage to 100% / n, for n expected MPS client processes). This strategy will allocate close to the minimum amount of execution resources, but it could restrict performance for clients that could occasionally make use of idle resources.

A more optimal strategy is to uniformly partition the portion by half of the number of expected clients (i.e., set active thread percentage to 100% / 0.5n) to give the load balancer more freedom to overlap execution between clients when there are idle resources.

The near optimal provision strategy is to non-uniformly partition the available threads based on the workloads of each MPS clients (i.e., set active thread percentage to 30% for client 1 and set active thread percentage to 70 % client 2 if the ratio of the client 1 workload and the client2 workload is 30%: 70%). This strategy will concentrate the work submitted by different clients to disjoint sets of the SMs and effectively minimize the interference between work submissions by different clients.

The most optimal provision strategy is to precisely limit the number of SMs to use for each MPS clients knowing the execution resource requirements of each client (i.e., 24 SMs for client1 and 60 SMs for client 2 on a device with 84 SMs). This strategy provides finer grained and more flexible control over the set of SMs the work will be running on than the active thread percentage.

If the active thread percentage is used for partitioning, the limit will be internally rounded down to the nearest hardware supported thread count limit. If the programmatic interface is used for partitioning, the limit will be internally rounded up to the nearest hardware supported SM count limit.

Threads and Linux Scheduling#

On pre-Volta GPUs, launching more MPS clients than there are available logical cores on your machine will incur increased launch latency and will generally slow down client-server communication due to how the threads get scheduled by the Linux CFS (Completely Fair Scheduler). For setups where multiple GPUs are used with an MPS control daemon and server started per GPU, we recommend pinning each MPS server to a distinct core. This can be accomplished by using the utility taskset, which allows binding a running program to multiple cores or launching a new one on them. To accomplish this with MPS, launch the control daemon bound to a specific core, for example, taskset -c 0 nvidia-cuda-mps-control -d. The process affinity will be inherited by the MPS server when it starts up.

Volta MPS Device Memory Limit#

On Volta MPS, users can enforce clients to adhere to allocate device memory up to a preset limit. This mechanism provides a facility to fractionalize GPU memory across MPS clients that run on the specific GPU, which enables scheduling and deployment systems to make decisions based on the memory usage for the clients. If a client attempts to allocate memory beyond the preset limit, the cuda memory allocation calls will return out of memory error. The memory limit specific will also account for CUDA internal device allocations which will help users make scheduling decisions for optimal GPU utilization. This can be accomplished through a hierarchy of control mechanisms for users to limit the pinned device memory on MPS clients. The default limit setting would enforce a device memory limit on all the MPS clients of all future MPS Servers spawned. The per server limit setting allows finer grained control on the memory resource limit whereby users have the option to set memory limit selectively using the server PID and thus all clients of the server. Additionally, MPS clients can further constrain the memory limit setting from the server by using the CUDA_MPS_PINNED_DEVICE_MEM_LIMIT environment variable.

Interaction with Tools#

Debugging and CUDA-GDB#

On Volta MPS, GPU coredumps can be generated and debugged using CUDA-GDB. Refer to CUDA-GDB documentation <https://docs.nvidia.com/cuda/cuda-gdb/index.html>`__ for usage instructions.

Under certain conditions applications invoked from within CUDA-GDB (or any CUDA-compatible debugger, such as Allinea DDT) may be automatically run without using MPS, even when MPS automatic provisioning is active. To take advantage of this automatic fallback, no other MPS client applications may be running at the time. This enables debugging of CUDA applications without modifying the MPS configuration for the system.

Here’s how it works:

  1. CUDA-GDB attempts to run an application and recognizes that it will become an MPS client.

  2. The application running under CUDA-GDB blocks in cuInit() and waits for all of the active MPS client processes to exit, if any are running.

  3. Once all client processes have terminated, the MPS server will allow CUDA-GDB and the application being debugged to continue.

  4. Any new client processes attempt to connect to the MPS daemon will be provisioned a server normally.

memcheck#

The memcheck tool is supported on MPS. Refer to the memcheck documentation for usage instructions.

Profiling#

CUDA profiling tools (such as nvprof and Nvidia Visual Profiler) and CUPTI-based profilers are supported under MPS.

Refer to Migrating to Nsight Tools from Visual Profiler and nvprof for more details.

Client Early Termination#

Terminating an MPS client via CTRL-C or signals is not supported and will lead to undefined behavior. The user must guarantee that the MPS client is idle, by calling either cudaDeviceSynchronize or cudaStreamSynchronize on all streams, before the MPS client can be terminated. Early termination of a MPS client without synchronizing all outstanding GPU work may leave the MPS server in an undefined state and result in unexpected failures, corruptions, or hangs; as a result, the affected MPS server and all its clients must be restarted.

On Volta MPS, user can instruct the MPS server to terminate the CUDA contexts of an MPS client process, regardless of whether the CUDA contexts are idle or not, by using the control command terminate_client <server PID> <client PID>. This mechanism enables user to terminate the CUDA contexts of a given MPS client process, even when the CUDA contexts are non-idle, without affecting the MPS server or its other MPS clients. The control command terminate_client sends a request to the MPS server which terminates the CUDA contexts of the target MPS client process on behalf of the user and returns after the MPS server has completed the request. The return value is CUDA_SUCCESS if the CUDA contexts of the target MPS client process have been successfully terminated; otherwise, a CUDA error describing the failure state. When the MPS server starts handling the request, each MPS client context running in the target MPS client process becomes INACTIVE; the status changes will be logged by the MPS server. Upon successful completion of the client termination, the target MPS client process will observe a sticky error CUDA_ERROR_MPS_CLIENT_TERMINATED, and it becomes safe to kill the target MPS client process with signals such as SIGKILL without affecting the rest of the MPS server and its MPS clients. Note that the MPS server is not responsible for killing the target MPS client process after the sticky error is set because the target MPS client process might want to:

  • Perform clean-up of its GPU or CPU state. This may include a device reset. Continue remaining CPU work.

  • Continue remaining CPU work.

If the user wants to terminate the GPU work of a MPS client process that is running inside a PID namespace different from the MPS control’s PID namespace, such as an MPS client process inside a container, the user must use the PID of the target MPS client process translated into the MPS control’s PID namespace. For example, the PID of an MPS client process inside the container is 6, and the PID of this MPS client process in the host PID namespace is 1024; the user must use 1024 to terminate the GPU work of the target MPS client process.

The common workflow for terminating the client application nbody:

Use the control command ps to get the status of the current active MPS clients.

$ echo "ps" | nvidia-cuda-mps-control

PID ID SERVER DEVICE NAMESPACE COMMAND

9741 0 6472 GPU-cb1213a3-d6a4-be7f 4026531836 ./nbody

9743 0 6472 GPU-cb1213a3-d6a4-be7f 4026531836 ./matrixMul

Terminate using the PID of nbody in the host PID namespace as reported by ps:

$ echo "terminate_client 6472 9741" | nvidia-cuda-mps-control

#wait until terminate_client to return

#upon successful termination 0 is returned

0

Now it is safe to kill nbody:

$ kill -9 9741

MPS client termination is not supported on Tegra platforms.

Client Priority Level Control#

Users are normally only able to control the GPU priority level of their kernels by using the cudaStreamCreateWithPriority() API while the program is being written. On Volta MPS, the user can use the control command set_default_client_priority <Priority Level> to map the stream priorities of a given client to a different range of internal CUDA priorities. Changes to this setting do not take effect until the next client connection to the server is opened. The user can also set the CUDA_MPS_CLIENT_PRIORITY environment variable before starting the control daemon or any given client process to set this value.

In this release, the allowed priority level values are 0 (normal) and 1 (below normal). Lower numbers map to higher priorities to match the behavior of the Linux kernel scheduler.

Note

CUDA priority levels are not guarantees of execution order–they are only a performance hint to the CUDA Driver.

For example:

  • Process A is launched at Normal priority and only uses the default CUDA Stream, which has the lowest priority of 0.

  • Process B is launched at Below Normal priority and uses streams with custom Stream priority values, such as -3.

Without this feature, the streams from Process B would be executed first by the CUDA driver. However, with the Client Priority Level feature, the streams from Process A will take precedence.

Memory Locality Optimized Partitions#

On some Blackwell and newer GPUs, users are able to create memory locality optimized partitions (MLOPart) by using the control command start_server -pid <PID> -mlopart.

When a server is created with this option, GPUs that are capable of creating MLOPart devices will do so. MLOPart devices are CUDA devices that are derived from another GPU, and have been optimized for lower-latency and/or higher-bandwidth by ensuring that compute and memory resources are physically colocated.

When using MLOPart devices, users will note that there are multiple MLOPart devices for each underlying device, with the MLOPart devices having fewer compute resources and available memory, and being optimized for memory locality.

MLOPart Device Enumeration#

Because a client created by an MLOPart enabled server may have more CUDA devices than it would otherwise, there is ambiguity in the device enumeration. This has consequences for other MPS control commands, as well as CUDA_VISIBLE_DEVICES. To resolve this ambiguity, the device_query command can be used to determine the device enumeration after taking MLOPart into account.

$ echo device_query | nvidia-cuda-mps-control

Default

Device Ordinal  PCI IDs        UUID               Name                   Attributes

0               0000:20.00.00  GPU-1d925fce-3b7d  NVIDIA B300

1               0000:48.00.00  GPU-468af2de-d4f0  NVIDIA B300

2               0000:57.00.00  GPU-a74f1c76-1ca2  NVIDIA B300

3               0000:66.00.00  GPU-4fe2ee0e-71e4  NVIDIA B300

4               0000:a2.00.00  GPU-ced3eaa1-26b0  NVIDIA B300

5               0000:c8.00.00  GPU-118a6ac9-86d2  NVIDIA B300

6               0000:d6.00.00  GPU-1ec93f24-564f  NVIDIA B300

Server 908527

Device Ordinal  PCI IDs        UUID               Name                   Attributes

N/A             0000:20.00.00  GPU-1d925fce-3b7d  NVIDIA B300            M

0               0000:20.00.00  GPU-3b861d38-5e0c  NVIDIA B300 MLOPart 0  MD

1               0000:20.00.00  GPU-d74bb67a-2db6  NVIDIA B300 MLOPart 1  MD

N/A             0000:48.00.00  GPU-468af2de-d4f0  NVIDIA B300            M

2               0000:48.00.00  GPU-5b669fd8-1170  NVIDIA B300 MLOPart 0  MD

3               0000:48.00.00  GPU-1e489f1a-6f85  NVIDIA B300 MLOPart 1  MD

N/A             0000:57.00.00  GPU-a74f1c76-1ca2  NVIDIA B300            M

4               0000:57.00.00  GPU-0384bea6-6424  NVIDIA B300 MLOPart 0  MD

5               0000:57.00.00  GPU-be428d68-2cea  NVIDIA B300 MLOPart 1  MD

N/A             0000:66.00.00  GPU-4fe2ee0e-71e4  NVIDIA B300            M

6               0000:66.00.00  GPU-2b3a2dae-3c97  NVIDIA B300 MLOPart 0  MD

7               0000:66.00.00  GPU-405be3d0-f1b5  NVIDIA B300 MLOPart 1  MD

N/A             0000:a2.00.00  GPU-ced3eaa1-26b0  NVIDIA B300            M

8               0000:a2.00.00  GPU-2b3fa6b7-1e62  NVIDIA B300 MLOPart 0  MD

9               0000:a2.00.00  GPU-ddf3743f-bed6  NVIDIA B300 MLOPart 1  MD

N/A             0000:c8.00.00  GPU-118a6ac9-86d2  NVIDIA B300            M

10              0000:c8.00.00  GPU-2dcf8afb-6f79  NVIDIA B300 MLOPart 0  MD

11              0000:c8.00.00  GPU-1458f72b-8fdb  NVIDIA B300 MLOPart 1  MD

N/A             0000:d6.00.00  GPU-1ec93f24-564f  NVIDIA B300            M

12              0000:d6.00.00  GPU-9cfe5596-8fec  NVIDIA B300 MLOPart 0  MD

13              0000:d6.00.00  GPU-6bef62f3-40d9  NVIDIA B300 MLOPart 1  MD

The device_query commands displays important information regarding MLOPart, including the device ordinals after taking MLOPart into account, the PCI IDs (which can be used to determine which devices belong to the same GPU), the MLOPart UUIDs, device names, and attributes. In the above example, the attribute MD indicates an MLOPart device, while M indicates the underlying GPU.

The ps command is also capable of showing whether a process is using an MLOPart device.

$ while1 -a &

[1] 52845

$ echo ps | nvidia-cuda-mps-control

PID       ID    SERVER    DEVICE             NAMESPACE      COMMAND     ATTRIBUTES

52845     1     52837     GPU-b13add01-c28c  4026531836     while1      MD

Again, the attribute MD indicates that this is an MLOPart device.

P2P with MLOPart#

MLOPart devices that are derived from the same underlying GPU have the special property of sharing GPU memory. They are capable of directly addressing each other’s memory without enabling peer-to-peer access.

When interacting with memory on another device (MLOPart or otherwise), the access modifier set by cuMemSetAccess is shared for all MLOPart devices belonging to the same underlying GPU, and is the least restrictive access type that is set. For example, if a pointer is desired to be set as read-only by an MLOPart device, that access type should be set for all MLOPart devices belonging the same underlying GPU. If it is only set for a single MLOPart device, then it will have no effect since other MLOPart devices still have the default read/write permission. Similarly, if there are two MLOPart devices, and one of them has memory with read-only permission to a pointer while another has no-access to a pointer, they will both have read-only permission.

Limitations of MLOPart#

There are some limitations when using MLOPart devices:

  • nvidia-smi is not aware of MLOPart devices. For example, using nvidia-smi -L will not show MLOPart devices, their UUIDs, or ordinals.

  • Host allocations or allocations created through cuMemAllocManaged will see no benefit from using MLOPart devices.

  • MLOPart does not support NVLink Multicast.

  • MLOPart devices have less total memory than their underlying device. Each MLOPart device tracks their free memory (queried through cuMemGetInfo) independently, however allocating memory on one MLOPart device may affect the remaining free memory on another MLOPart device. Additionally, using cuMemAllocManaged may result in less free memory on any or all MLOPart devices.

Additionally, there are some configurations in which MLOPart is not supported:

  • MLOPart is only supported on x86-based systems. Support for ARM-based systems will be available in a future release.

  • MLOPart cannot be used in conjunction with MPS static partioning. The -mlopart option of start_server will be ignored if static partioning is enabled.

  • MLOPart cannot be used in conjunction with NVIDIA vGPU.

  • MIG devices do not support MLOPart. Using MIG on one GPU does not prevent using MLOPart on another GPU.

Static SM Partitioning#

On NVIDIA Ampere architecture and newer GPUs, users can enable static SM partitioning mode to create SM partitions for MPS clients. This feature provides deterministic resource allocation and spatial isolation between clients by allowing users to explicitly control which SMs each client can access. Static partitioning mode is enabled at MPS control daemon launch time using the -S or --static-partitioning command line parameter.

nvidia-cuda-mps-control -d -S

Static partitioning mode uses “chunks” as the partitioning unit. The size of a chunk depends on the GPU architecture:

  • dGPU: 4 SMs on pre-Hopper GPUs, 8 SMs on Hopper and newer GPUs on which chunks are Hopper+ are cluster capable)

  • iGPU: 2 SMs on all architectures

Creating and Managing Partitions#

Users create SM partitions by specifying the number of chunks using the sm_partition add command. User can either use the full device UUID or a unique partial UUID of the device. Upon successful creation, a unique partition ID is returned that can be used to assign clients to that partition:

echo "sm_partition add GPU-74d43ed3-cdf7-e667-3644-bf5b4f46ed65 7" | nvidia-cuda-mps-control
GPU-74d43ed3-cdf7-e667-3644-bf5b4f46ed65/Dx4AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA

Attempting to oversubscribe will result in an error message indicating the requested and available SM counts.

echo "sm_partition add GPU-74d43ed3-cdf7-e667-3644-bf5b4f46ed65 20" | nvidia-cuda-mps-control
Failed to fulfill the requested SM partition of 20 chunks, error CUDA_ERROR_INVALID_RESOURCE_CONFIGURATION

To assign this partition to a client application, set the CUDA_MPS_SM_PARTITION environment variable:

export CUDA_MPS_SM_PARTITION=GPU-74d43ed3-cdf7-e667-3644-bf5b4f46ed65/Dx4AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
./nbody

One or more MPS clients can run on the same partition. However, assigning more than one partition from the same device to the same client is prohibited and will fail with CUDA_ERROR_INVALID_RESOURCE_CONFIGURATION. When static partitioning mode is enabled, MPS client applications must set this environment variable before initializing the MPS client. Attempting to run a client without setting the SM partition will fail on client initialization with CUDA_ERROR_INVALID_RESOURCE_CONFIGURATION. Similarly, if the partition ID is invalid, client initialization will also fail with CUDA_ERROR_INVALID_RESOURCE_CONFIGURATION.

Users can view the current partitioning configuration using the lspart command.

echo "lspart" | nvidia-cuda-mps-control
GPU           Partition                             free    used    free  used  clients
                                                    chunk   chunk   SM    SM
GPU-74d43ed3  -                                     0       8       74    56    -
GPU-74d43ed3  Dx4AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA  -       7        -    56    Yes

User removes partition by specifying the device UUID and the partition ID using the sm_partition rm command.

echo "sm_partition rm GPU-74d43ed3-cdf7-e667-3644-bf5b4f46ed65/Dx4AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA" | nvidia-cuda-mps-control
Partition GPU-74d43ed3-cdf7-e667-3644-bf5b4f46ed65/Dx4AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA removed

Note that the command will fail if the partition is still in use.

echo "sm_partition rm GPU-74d43ed3-cdf7-e667-3644-bf5b4f46ed65/Dx4AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA" | nvidia-cuda-mps-control
Partition GPU-74d43ed3-cdf7-e667-3644-bf5b4f46ed65/Dx4AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA in use. Terminate all clients before removing.

The rm command supports both “rm <device UUID>/<partition ID>” and “rm <device UUID> <partition ID>” formats.

echo "sm_partition rm GPU-74d43ed3-cdf7-e667-3644-bf5b4f46ed65 Dx4AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA" | nvidia-cuda-mps-control
Partition GPU-74d43ed3-cdf7-e667-3644-bf5b4f46ed65/Dx4AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA removed

Partition ID Determinism#

Partition IDs are determined by the partition size, the command sequence order, and the distribution of reserved SMs at the time of partitioning. This means that repeating the same command sequence from the same initial state on systems with identical GPU SKUs will reliably produce the same partition IDs, enabling reproducible configurations across different systems.

Resource Distribution#

When static partitioning mode is enabled, after all chunks are allocated, any remaining SMs are automatically distributed to the least recently created partition when the first client connects. This ensures optimal utilization of available SM resources.

API Behavior with Static Partitioning#

When a client is running on a static SM partition, CUDA API calls reflect the resources available to that partition:

  • cuDeviceGetAttribute with CU_DEVICE_ATTRIBUTE_MULTIPROCESSOR_COUNT returns the total SM count of the partition.

  • cuOccupancyMaxActiveClusters returns the cluster occupancy of the partition on Hopper and newer GPUs.

Limitations of Static SM Partitioning#

Static SM partitioning has the following limitations and requirements:

  • Static partitioning mode is only supported on NVIDIA Ampere architecture and newer GPUs.

  • When static partitioning mode is enabled, all MPS clients must set the CUDA_MPS_SM_PARTITION environment variable before creating a CUDA context. Failure to do so will result in context creation failing with CUDA_ERROR_INVALID_RESOURCE_CONFIGURATION.

  • Assigning multiple partitions from the same device to a single client is prohibited and will fail with CUDA_ERROR_INVALID_RESOURCE_CONFIGURATION.

  • Partitions cannot be removed while clients are actively using them.

  • Attempting to create partitions that exceed available SM resources will fail with an error message indicating the requested and available SM counts.

  • Static SM partitioning cannot be used in conjunction with MLOPart. The -mlopart option of start_server will be ignored if static partitioning is enabled.

  • When static partitioning mode is enabled, dynamic resource provisioning will be ignored, this includes but not limited to MPS active thread percentage and per-context device multiprocessor partitioning.