6. Limitations

The following are known issues with the current release.
  • A security vulnerability issue required profiling tools to disable all the features for non-root or non-admin users. As a result, CUPTI cannot profile the application when using a Windows 419.17 or Linux 418.43 or later driver. More details about the issue and the solutions can be found on this web page.
    Note: Starting with CUDA 10.2, CUPTI allows tracing features for non-root and non-admin users on desktop platforms. But events and metrics profiling is still restricted for non-root and non-admin users.
  • Profiling results might be inconsistent when auto boost is enabled. Profiler tries to disable auto boost by default. But it might fail to do so in some conditions and profiling will continue and results will be inconsistent. API cuptiGetAutoBoostState() can be used to query the auto boost state of the device. This API returns error CUPTI_ERROR_NOT_SUPPORTED on devices that don't support auto boost. Note that auto boost is supported only on certain Tesla devices with compute capability 3.0 and higher.
  • CUPTI doesn't populate the activity structures which are deprecated, instead the newer version of the activity structure is filled with the information.
  • Because of the low resolution of the timer on Windows, the start and end timestamps can be same for activities having short execution duration on Windows.
  • The application which calls CUPTI APIs cannot be used with Nvidia tools like nvprof, Nvidia Visual Profiler, Nsight Compute, Nsight Systems, Nvidia Nsight Visual Studio Edition, cuda-gdb and cuda-memcheck.
  • CUDA runtime and driver API callbacks for kernel launch are not issued when the stream is in the capture mode.
  • PCIE and NVLINK records are not captured when CUPTI is initialized lazily after the CUDA initialization.
  • CUPTI fails to profile the OpenACC application when the OpenACC library linked with the application has missing definition of the OpenACC API routine/s. This is indicated by the error code CUPTI_ERROR_OPENACC_UNDEFINED_ROUTINE.
  • OpenACC profiling might fail when OpenACC library is linked statically in the user application. This happens due to the missing definition of the OpenACC API routines needed for the OpenACC profiling, as compiler might ignore definitions for the functions not used in the application. This issue can be mitigated by linking the OpenACC library dynamically.
  • Unified memory profiling is not supported on the ARM architecture.
  • Profiling a C++ application which overloads the new operator at the global scope and uses any CUDA APIs like cudaMalloc() or cudaMallocManaged() inside the overloaded new operator will result in a hang.
  • Devices with compute capability 6.0 and higher introduce a new feature, compute preemption, to give fair chance for all compute contexts while running long tasks. With compute preemption feature-
    • If multiple contexts are running in parallel it is possible that long kernels will get preempted.
    • Some kernels may get preempted occasionally due to timeslice expiry for the context.
    If kernel has been preempted, the time the kernel spends preempted is still counted towards kernel duration.
    Compute preemption can affect events and metrics collection. The following are known issues with the current release:
    • Events and metrics collection for a MPS client can result in higher counts than expected on devices with compute capability 7.0 and higher, since MPS client may get preempted due to termination of another MPS client.
    • Events warps_launched and sm_cta_launched and metric inst_per_warp might provide higher counts than expected on devices with compute capability 6.0 and higher. Metric unique_warps_launched can be used in place of warps_launched to get correct count of actual warps launched as it is not affected by compute preemption.

    To avoid compute preemption affecting profiler results try to isolate the context being profiled:

    • Run the application on secondary GPU where display is not connected.
    • On Linux if the application is running on the primary GPU where the display driver is connected then unload the display driver.
    • Run only one process that uses GPU at one time.
  • Devices with compute capability 6.0 and higher support demand paging. When the kernel is scheduled for the first time, all the pages allocated using cudaMallocManaged and that are required for execution of the kernel are fetched in the global memory when GPU faults are generated. Profiler requires multiple passes to collect all the metrics required for kernel analysis. The kernel state needs to be saved and restored for each kernel replay pass. For devices with compute capability 6.0 and higher and platforms supporting Unified memory, in the first kernel iteration the GPU faults will be generated and all pages will be fetched in the global memory. Second iteration onwards GPU page faults will not occur. This will significantly affect the memory related events and timing. The time taken from trace will include the time required to fetch the pages but most of the metrics profiled in multiple iterations will not include time/cycles required to fetch the pages. This causes inconsistency in the profiler results.
  • When profiling an application that uses CUDA Dynamic Parallelism (CDP) there are several limitations to the profiling tools.
    • Starting with CUDA Toolkit 9.0, CUPTI doesn't support CUDA Dynamic Parallelism (CDP) kernel launch tracing for devices with compute capability 7.0 and higher.
    • CUPTI doesn't report CUDA API calls for device-launched kernels.
    • CUPTI doesn't report detailed event, metric, and source-level results for device-launched kernels. Event, metric, and source-level results collected for CPU-launched kernels will include event, metric, and source-level results for the entire call-tree of kernels launched from within that kernel.
  • Compilation of samples autorange_profiling and userrange_profiling requires a host compiler which supports C++11 features. For some g++ compilers, it is required to use the flag -std=c++11 to turn on C++11 features.

Events and Metrics Profiling

The following are known issues related to Events and Metrics profiling:
  • The CUPTI event APIs from the header cupti_events.h and metric APIs from the header cupti_metrics.h are not supported for the devices with compute capability 7.5 and higher. These are replaced by Profiling API and Perfworks metrics API. Refer to the section Migration to the new Profiling API.
  • While collecting events in continuous mode, event reporting may be delayed i.e. event values may be returned by a later call to readEvent(s) API and the event values for the last readEvent(s) API may get lost.
  • When profiling events, it is possible that the domain instance that gets profiled gives event value 0 due to absence of workload on the domain instance since CUPTI profiles one instance of the domain by default. To profile all instances of the domain, user can set event group attribute CUPTI_EVENT_GROUP_ATTR_PROFILE_ALL_DOMAIN_INSTANCES through API cuptiEventGroupSetAttribute().
  • Profiling results might be incorrect for CUDA applications compiled with nvcc version older than 9.0 for devices with compute capability 6.0 and 6.1. Profiling session will continue and CUPTI will notify it using error code CUPTI_ERROR_CUDA_COMPILER_NOT_COMPATIBLE. It is advised to recompile the application code with nvcc version 9.0 or later. Ignore this warning if code is already compiled with the recommended nvcc version
  • Profiling is not supported for multidevice cooperative kernels, that is, kernels launched by using the API functions cudaLaunchCooperativeKernelMultiDevice or cuLaunchCooperativeKernelMultiDevice.
  • Profiling is not supported for CUDA kernel nodes launched by a CUDA Graph.
  • PC Sampling is not supported on Tegra platforms.
  • Events and metrics profiling is not supported on virtual GPUs (vGPU).
  • Profiling a kernel while other contexts are active on the same device (e.g. X server, or secondary CUDA or graphics application) can result in varying metric values for L2/FB (Device Memory) related metrics. Specifically, L2/FB traffic from non-profiled contexts cannot be excluded from the metric results. To completely avoid this issue, profile the application on a GPU without secondary contexts accessing the same device (e.g. no X server on Linux).
  • In the current release, profiling a kernel while any other GPU work is executing on the same MIG compute instance can result in varying metric values for all units. Care should be taken to serialize, or otherwise prevent concurrent CUDA launches within the target application to ensure those kernels do not influence each other. Be aware that GPU work issued through other APIs in the target process or workloads created by non-target processes running simultaneously in the same MIG compute instance will influence the collected metrics. Note that it is acceptable to run CUDA processes in other MIG compute instances as they will not influence the profiled MIG compute instance.
  • Events or metrics collection may significantly change the overall performance characteristics of the application. Refer section CUPTI Overhead for more details.
  • For some metrics, the required events can only be collected for a single CUDA context. For an application that uses multiple CUDA contexts, these metrics will only be collected for one of the contexts. The metrics that can be collected only for a single CUDA context are indicated in the metric reference tables.
  • Some metric values are calculated assuming a kernel is large enough to occupy all device multiprocessors with approximately the same amount of work. If a kernel launch does not have this characteristic, then those metric values may not be accurate.
  • Some events and metrics are not available on all devices. For list of metrics, you can refer to the metric reference tables.
  • Profiler events and metrics do not work correctly on OS X 10.8.5 and OS X 10.9.3. OS X 10.9.2 or OS X 10.9.4 or later can be used.
  • Enabling certain events can cause GPU kernels to run longer than the driver's watchdog time-out limit. In these cases the driver will terminate the GPU kernel resulting in an application error and profiling data will not be available. Please disable the driver watchdog time out before profiling such long running CUDA kernels
    • On Linux, setting the X Config option Interactive to false is recommended.
    • For Windows, detailed information about TDR (Timeout Detection and Recovery) and how to disable it is available at https://docs.microsoft.com/en-us/windows-hardware/drivers/display/timeout-detection-and-recovery
  • CUPTI can give out of memory error for event and metrics profiling, it could be due to large number of instructions in the kernel.
  • For devices with compute capability 8.0, the NVLink topology information is available but metrics information is not available.