CUDA Samples

The reference guide for the CUDA Samples.

1. Release Notes

This section describes the release notes for the CUDA Samples only. For the release notes for the whole CUDA Toolkit, please see CUDA Toolkit Release Notes.

1.1. CUDA 11.6

  • All CUDA samples are now only available on GitHub repository. They are no longer available via CUDA toolkit.
  • Added new folder structure for samples.
  • Added Visual Studio 2022 support to all the samples.

1.2. CUDA 11.5

  • Added 4_CUDA_Libraries/cuDLAHybridMode. Demonstrate usage of cuDLA in hybrid mode. (available only on GitHub repository)
  • Added 4_CUDA_Libraries/cuDLAStandaloneMode. Demonstrate usage of cuDLA in standalone mode. (available only on GitHub repository)
  • Added 4_CUDA_Libraries/cuDLAErrorReporting. Demonstrate DLA error detection via CUDA. (available only on GitHub repository)
  • Added 3_CUDA_Features/graphMemoryNodes. Demonstrates memory allocations and frees within CUDA graphs using Graph APIs and Stream Capture APIs. (available only on GitHub repository)
  • Added 3_CUDA_Features/graphMemoryFootprint. Demonstrates how graph memory nodes re-use virtual addresses and physical memory. (available only on GitHub repository)

1.3. CUDA 11.4 Update 1

  • Added support for VS Code on linux platform.

1.4. CUDA 11.4

  • Added 7_CUDALibraries/simpleCUBLAS_LU. Demonstrates batched matrix LU decomposition using cuBLAS API cublas<t>getrfBatched().
  • Updated 2_Graphics/simpleVulkan, 2_Graphics/simpleVulkanMMAP and 3_Imaging/vulkanImageCUDA. Demonstrates use of SPIR-V shaders.
  • Removed 7_CUDALibraries/boundSegmentsNPP.

1.5. CUDA 11.3

  • Added 0_Simple/streamOrderedAllocationIPC. Demonstrates IPC pools of stream ordered memory allocated using cudaMallocAsync and cudaMemPool family of APIs.
  • Updated 2_Graphics/simpleVulkan. Demonstrates use of timeline semaphore.
  • Updated 0_Simple/globalToShmemAsyncCopy with a partitioned cuda pipeline producer-consumer GEMM kernel.
  • Updated multiple samples to use pinned memory using cudaMallocHost().

1.6. CUDA 11.2

  • FreeImage is no longer distributed with the CUDA Samples. On Windows, see the Dependencies section for more details on how to set up FreeImage. On Linux, it is recommended to install FreeImage with your distribution's package manager.

1.7. CUDA 11.1

  • Added 2_Graphics/simpleVulkanMMAP. Demonstrates Vulkan CUDA Interop via cuMemMap APIs where CUDA buffer is imported in vulkan.
  • Added 7_CUDALibraries/watershedSegmentationNPP. Demonstrates how to use the NPP watershed segmentation function.
  • Added 7_CUDALibraries/batchedLabelMarkersAndLabelCompressionNPP. Demonstrates how to use the NPP label markers generation and label compression functions based on a Union Find (UF) algorithm including both single image and batched image versions.
  • Deprecated Visual Studio 2015 support for all Windows supported samples.
  • Dropped Visual Studio 2012, 2013 support from all the Windows supported samples.

1.8. CUDA 11.0

  • Added 0_Simple/globalToShmemAsyncCopy. Demonstrates asynchronous copy of data from global to shared memory using cuda pipeline. Also demonstrates arrive-wait barrier for synchronization.
  • Added 0_Simple/simpleAttributes. Demonstrates the stream attributes that affect L2 locality.
  • Added 0_Simple/dmmaTensorCoreGemm. Demonstrates double precision GEMM computation using the WMMA API for double precision employing the Tensor Cores. Also makes use of asynchronous copy from global to shared memory using cuda pipeline which leads to further performance gain.
  • Added 0_Simple/bf16TensorCoreGemm. Demonstrates __nv_bfloat16 (e8m7) GEMM computation using the WMMA API for __nv_bfloat16 employing the Tensor Cores. Also makes use of asynchronous copy from global to shared memory using cuda pipeline which leads to further performance gain.
  • Added 0_Simple/tf32TensorCoreGemm. Demonstrates tf32 (e8m10) GEMM computation using the WMMA API for tf32 employing the Tensor Cores. Also makes use of asynchronous copy from global to shared memory using cuda pipeline which leads to further performance gain.
  • Added 0_Simple/simpleAWBarrier. Demonstrates the arrive wait barriers.
  • Added warp aggregated atomic multi bucket increments kernel using labeled_partition cooperative groups in 6_Advanced/warpAggregatedAtomicsCG which can be used on compute capability 7.0 and above GPU architectures.
  • Added 0_Simple/binaryPartitionCG. Demonstrates binary_partition cooperative groups creation and usage in divergent path.
  • Added 6_Advanced/cudaCompressibleMemory. Demonstrates compressible memory allocation using cuMemMap API.
  • Removed 7_CUDALibraries/nvgraph_Pagerank, 7_CUDALibraries/nvgraph_SemiRingSpMV, 7_CUDALibraries/nvgraph_SpectralClustering, 7_CUDALibraries/nvgraph_SSSP as the NVGRAPH library is dropped from CUDA Toolkit 11.0.
  • Added two new reduction kernels in 6_Advanced/reduction one which demonstrates reduce_add_sync intrinstic supported on compute capability 8.0 and another which uses cooperative_groups::reduce function which does thread_block_tile level reduction introduced from CUDA 11.0
  • Added windows support to 6_Advanced/c++11_cuda.

1.9. CUDA 10.2

  • Added 6_Advanced/jacobiCudaGraphs. Demonstrates Instantiated CUDA Graph Update usage.
  • Added 0_Simple/memMapIPCDrv. Demonstrates Inter Process Communication using cuMemMap APIs with one process per GPU for computation.
  • Added 0_Simple/vectorAddMMAP. Demonstrates how cuMemMap API allows the user to specify the physical properties of their memory while retaining the contiguous nature of their access, thus not requiring a change in their program structure.
  • Added 0_Simple/simpleDrvRuntime. Demonstrates how CUDA Driver and Runtime APIs can work together to load cuda fatbinary of vector add kernel.
  • Added 0_Simple/cudaNvSci. Demonstrates CUDA-NvSciBuf/NvSciSync Interop.

1.10. CUDA 10.1 Update 2

  • Added 3_Imaging/vulkanImageCUDA. Demonstrates how to perform Vulkan Image-CUDA Interop.
  • Added 7_CUDALibraries/nvJPEG_encoder. Demonstrates encoding of jpeg images using NVJPEG Library.
  • Added Windows support to 7_CUDALibraries/nvJPEG.
  • Removed DirectX SDK (June 2010 or newer) installation requirement, all the DirectX-CUDA samples now use DirectX from Windows SDK shipped with Microsoft Visual Studio 2012 or higher

1.11. CUDA 10.1 Update 1

  • Added 3_Imaging/NV12toBGRandResize. Demonstrates how to convert and resize NV12 frames to BGR planars frames using CUDA in batch.
  • Added Visual Studio 2019 support to all the samples.

1.12. CUDA 10.1

  • Added 0_Simple/immaTensorCoreGemm. Demonstrates integer GEMM computation using the Warp Matrix Multiply and Accumulate (WMMA) API for integers employing the Tensor Cores.
  • Added 2_Graphics/simpleD3D12. Demonstrates Direct3D12 interoperability with CUDA.
  • Added 7_CUDALibraries/nvJPEG. Demonstrates single and batched decoding of jpeg images using NVJPEG Library.
  • Added 7_CUDALibraries/conjugateGradientCudaGraphs. Demonstrates conjugate gradient solver on GPU using CUBLAS/CUSPARSE library calls captured and called using CUDA Graph APIs.
  • Updated 0_Simple/simpleIPC to work on Windows OS as well with TCC enabled GPUs.

1.13. CUDA 10.0

  • Added 1_Utilities/UnifiedMemoryPerf. Demonstrates the performance comparision of Unified Memory and other types of memory like zero copy buffers, pageable, pagelocked memory on a single GPU.
  • Added 2_Graphics/simpleVulkan. Demonstrates the Vulkan-CUDA Interop. CUDA imports the Vulkan vertex buffer and operates on it to create sinewave, and synchronizes with Vulkan through vulkan semaphores imported by CUDA.
  • Added 0_Simple/simpleCudaGraphs. Demonstrates how to use CUDA Graphs through Graphs APIs and Stream Capture APIs.
  • Removed 3_Imaging/cudaDecodeGL, 3_Imaging/cudaDecodeD3D9 as the cuvid library is dropped from CUDA Toolkit 10.0.
  • Removed 6_Advanced/cdpLUDecomposition, 7_CUDALibraries/simpleDevLibCUBLAS as the CUBLAS Device library is dropped from CUDA Toolkit 10.0.

1.14. CUDA 9.2

  • Added 7_CUDALibraries/boundSegmentsNPP. Demonstrates nppiLabelMarkers to generate connected region segment labels.
  • Added 6_Advanced/conjugateGradientMultiDeviceCG. Demonstrates a conjugate gradient solver on multiple GPUs using Multi Device Cooperative Groups, also uses Unified Memory optimized using prefetching and usage hints.
  • Updated 0_Simple/fp16ScalarProduct to use fp16 native operators for half2 and other fp16 features, it also compare results of using native vs intrinsics fp16 operations.

1.15. CUDA 9.0

  • Added 7_CUDALibraries/nvgraph_SpectralClustering. Demonstrates Spectral Clustering using NVGRAPH Library.
  • Added 6_Advanced/warpAggregatedAtomicsCG. Demonstrates warp aggregated atomics using Cooperative Groups.
  • Added 6_Advanced/reductionMultiBlockCG. Demonstrates single pass reduction using Multi Block Cooperative Groups.
  • Added 6_Advanced/conjugateGradientMultiBlockCG. Demonstrates a conjugate gradient solver on GPU using Multi Block Cooperative Groups.
  • Added Cooperative Groups(CG) support to several samples notable ones to name are 6_Advanced/cdpQuadtree, 6_Advanced/cdpAdvancedQuicksort, 6_Advanced/threadFenceReduction, 3_Imaging/dxtc, 4_Finance/MonteCarloMultiGPU, 0_Simple/matrixMul_nvrtc.
  • Added 0_Simple/simpleCooperativeGroups. Illustrates basic usage of Cooperative Groups within the thread block.
  • Added 0_Simple/cudaTensorCoreGemm. Demonstrates a GEMM computation using the Warp Matrix Multiply and Accumulate (WMMA) API introduced in CUDA 9, as well as the new Tensor Cores introduced in the Volta chip family.
  • Updated 0_Simple/simpleVoteIntrinsics to use newly added *_sync equivalent of the vote intrinsics _any, _all.
  • Updated 6_Advanced/shfl_scan to use newly added *_sync equivalent of the shfl intrinsics.

1.16. CUDA 8.0

  • Added 7_CUDALibraries/FilterBorderControlNPP. Demonstrates how any border version of an NPP filtering function can be used in the most common mode (with border control enabled), can be used to duplicate the results of the equivalent non-border version of the NPP function, and can be used to enable and disable border control on various source image edges depending on what portion of the source image is being used as input.
  • Added 7_CUDALibraries/cannyEdgeDetectorNPP. Demonstrates the recommended parameters to use with the nppiFilterCannyBorder_8u_C1R Canny Edge Detection image filter function. This function expects a single channel 8-bit grayscale input image. You can generate a grayscale image from a color image by first calling nppiColorToGray() or nppiRGBToGray(). The Canny Edge Detection function combines and improves on the techniques required to produce an edge detection image using multiple steps.
  • Added 7_CUDALibraries/cuSolverSp_LowlevelCholesky. Demonstrates Cholesky factorization using cuSolverSP's low level APIs.
  • Added 7_CUDALibraries/cuSolverSp_LowlevelQR. Demonstrates QR factorization using cuSolverSP's low level APIs.
  • Added 7_CUDALibraries/BiCGStab. Demonstrates Bi-Conjugate Gradient Stabilized (BiCGStab) iterative method for nonsymmetric and symmetric positive definite linear systems using CUSPARSE and CUBLAS
  • Added 7_CUDALibraries/nvgraph_Pagerank. Demonstrates Page Rank computation using nvGRAPH Library.
  • Added 7_CUDALibraries/nvgraph_SemiRingSpMV. Demonstrates Semi-Ring SpMV using nvGRAPH Library.
  • Added 7_CUDALibraries/nvgraph_SSSP. Demonstrates Single Source Shortest Path(SSSP) computation using nvGRAPH Library.
  • Added 7_CUDALibraries/simpleCUBLASXT. Demonstrates simple example to use CUBLAS-XT library.
  • Added 6_Advanced/c++11_cuda. Demonstrates C++11 feature support in CUDA.
  • Added 1_Utilities/topologyQuery. Demonstrates how to query the topology of a system with multiple GPU.
  • Added 0_Simple/fp16ScalarProduct. Demonstrates scalar product calculation of two vectors of FP16 numbers.
  • Added 0_Simple/systemWideAtomics. Demonstrates system wide atomic instructions on migratable memory.
  • Removed 0_Simple/template_runtime. Its purpose is served by 0_Simple/template.

1.17. CUDA 7.5

  • Added 7_CUDALibraries/cuSolverDn_LinearSolver. Demonstrates how to use the CUSOLVER library for performing dense matrix factorization using cuSolverDN's LU, QR and Cholesky factorization functions.
  • Added 7_CUDALibraries/cuSolverRf. Demonstrates how to use cuSolverRF, a sparse re-factorization package of the CUSOLVER library.
  • Added 7_CUDALibraries/cuSolverSp_LinearSolver. Demonstrates how to use cuSolverSP which provides sparse set of routines for sparse matrix factorization.
  • The 2_Graphics/simpleD3D9, 2_Graphics/simpleD3D9Texture, 3_Imaging/cudaDecodeD3D9, and 5_Simulations/fluidsD3D9 samples have been modified to use the Direct3D 9Ex API instead of the Direct3D 9 API.
  • The 7_CUDALibraries/grabcutNPP and 7_CUDALibraries/imageSegmentationNPP samples have been removed. These samples used the NPP graphcut APIs, which have been deprecated in CUDA 7.5.

1.18. CUDA 7.0

  • Removed support for Windows 32-bit builds.
  • The Makefile x86_64=1 and ARMv7=1 options have been deprecated. Please use TARGET_ARCH to set the targeted build architecture instead.
  • The Makefile GCC option has been deprecated. Please use HOST_COMPILER to set the host compiler instead.
  • The CUDA Samples are no longer shipped as prebuilt binaries on Windows. Please use VS Solution files provided to build respective executable.
  • Added 0_Simple/clock_nvrtc. Demonstrates how to compile clock function kernel at runtime using libNVRTC to measure the performance of kernel accurately.
  • Added 0_Simple/inlinePTX_nvrtc. Demonstrates compilation of CUDA kernel having PTX embedded at runtime using libNVRTC.
  • Added 0_Simple/matrixMul_nvrtc. Demonstrates compilation of matrix multiplication CUDA kernel at runtime using libNVRTC.
  • Added 0_Simple/simpleAssert_nvrtc. Demonstrates compilation of CUDA kernel having assert() at runtime using libNVRTC.
  • Added 0_Simple/simpleAtomicIntrinsics_nvrtc. Demonstrates compilation of CUDA kernel performing atomic operations at runtime using libNVRTC.
  • Added 0_Simple/simpleTemplates_nvrtc. Demonstrates compilation of templatized dynamically allocated shared memory arrays CUDA kernel at runtime using libNVRTC.
  • Added 0_Simple/simpleVoteIntrinsics_nvrtc. Demonstrates compilation of CUDA kernel which uses vote intrinsics at runtime using libNVRTC.
  • Added 0_Simple/vectorAdd_nvrtc. Demonstrates compilation of CUDA kernel performing vector addition at runtime using libNVRTC.
  • Added 4_Finance/binomialOptions_nvrtc. Demonstrates runtime compilation using libNVRTC of CUDA kernel which evaluates fair call price for a given set of European options under binomial model.
  • Added 4_Finance/BlackScholes_nvrtc. Demonstrates runtime compilation using libNVRTC of CUDA kernel which evaluates fair call and put prices for a given set of European options by Black-Scholes formula.
  • Added 4_Finance/quasirandomGenerator_nvrtc. Demonstrates runtime compilation using libNVRTC of CUDA kernel which implements Niederreiter Quasirandom Sequence Generator and Inverse Cumulative Normal Distribution functions for the generation of Standard Normal Distributions.

1.19. CUDA 6.5

  • Added 7_CUDALibraries/cuHook. Demonstrates how to build and use an intercept library with CUDA.
  • Added 7_CUDALibraries/simpleCUFFT_callback. Demonstrates how to compute a 1D-convolution of a signal with a filter using a user-supplied CUFFT callback routine, rather than a separate kernel call.
  • Added 7_CUDALibraries/simpleCUFFT_MGPU. Demonstrates how to compute a 1D-convolution of a signal with a filter by transforming both into frequency domain, multiplying them together, and transforming the signal back to time domain on Multiple GPUs.
  • Added 7_CUDALibraries/simpleCUFFT_2d_MGPU. Demonstrates how to compute a 2D-convolution of a signal with a filter by transforming both into frequency domain, multiplying them together, and transforming the signal back to time domain on Multiple GPUs.
  • Removed 3_Imaging/cudaEncode. Support for the CUDA Video Encoder (NVCUVENC) has been removed.
  • Removed 4_Finance/ExcelCUDA2007. The topic will be covered in a blog post at Parallel Forall.
  • Removed 4_Finance/ExcelCUDA2010. The topic will be covered in a blog post at Parallel Forall.
  • The 4_Finance/binomialOptions sample is now restricted to running on GPUs with SM architecture 2.0 or greater.
  • The 4_Finance/quasirandomGenerator sample is now restricted to running on GPUs with SM architecture 2.0 or greater.
  • The 7_CUDALibraries/boxFilterNPP sample now demonstrates how to use the static NPP libraries on Linux and Mac.
  • The 7_CUDALibraries/conjugateGradient sample now demonstrates how to use the static CUBLAS and CUSPARSE libraries on Linux and Mac.
  • The 7_CUDALibraries/MersenneTwisterGP11213 sample now demonstrates how to use the static CURAND library on Linux and Mac.

1.20. CUDA 6.0

  • New featured samples that support a new CUDA 6.0 feature called UVM-Lite
  • Added 0_Simple/UnifiedMemoryStreams - new CUDA sample that demonstrates the use of OpenMP and CUDA streams with Unified Memory on a single GPU.
  • Added 1_Utilities/p2pBandwidthTestLatency - new CUDA sample that demonstrates how measure latency between pairs of GPUs with P2P enabled and P2P disabled.
  • Added 6_Advanced/StreamPriorities - This sample demonstrates basic use of the new CUDA 6.0 feature stream priorities.
  • Added 7_CUDALibraries/ConjugateGradientUM - This sample implements a conjugate gradient solver on GPU using cuBLAS and cuSPARSE library, using Unified Memory.

1.21. CUDA 5.5

  • Linux makefiles have been updated to generate code for the AMRv7 architecture. Only the ARM hard-float floating point ABI is supported. Both native ARMv7 compilation and cross compilation from x86 is supported
  • Performance improvements in CUDA toolkit for Kepler GPUs (SM 3.0 and SM 3.5)
  • Makefiles projects have been updated to properly find search default paths for OpenGL, CUDA, MPI, and OpenMP libraries for all OS Platforms (Mac, Linux x86, Linux ARM).
  • Linux and Mac project Makefiles now invoke NVCC for building and linking projects.
  • Added 0_Simple/cppOverload - new CUDA sample that demonstrates how to use C++ overloading with CUDA.
  • Added 6_Advanced/cdpBezierTessellation - new CUDA sample that demonstrates an advanced method of implementing Bezier Line Tessellation using CUDA Dynamic Parallelism. Requires compute capability 3.5 or higher.
  • Added 7_CUDALibrariess/jpegNPP - new CUDA sample that demonstrates how to use NPP for JPEG compression on the GPU.
  • CUDA Samples now have better integration with Nsight Eclipse IDE.
  • 6_Advanced/ptxjit sample now includes a new API to demonstrate PTX linking at the driver level.

1.22. CUDA 5.0

  • New directory structure for CUDA samples. Samples are classified accordingly to categories: 0_Simple, 1_Utilities, 2_Graphics, 3_Imaging, 4_Finance, 5_Simulations, 6_Advanced, and 7_CUDALibraries
  • Added 0_Simple/simpleIPC - CUDA Runtime API sample is a very basic sample that demonstrates Inter Process Communication with one process per GPU for computation. Requires Compute Capability 2.0 or higher and a Linux Operating System.
  • Added 0_Simple/simpleSeparateCompilation - demonstrates a CUDA 5.0 feature, the ability to create a GPU device static library and use it within another CUDA kernel. This example demonstrates how to pass in a GPU device function (from the GPU device static library) as a function pointer to be called. Requires Compute Capability 2.0 or higher.
  • Added 2_Graphics/bindlessTexture - demonstrates use of cudaSurfaceObject, cudaTextureObject, and MipMap support in CUDA. Requires Compute Capability 3.0 or higher.
  • Added 3_Imaging/stereoDisparity - demonstrates how to compute a stereo disparity map using SIMD SAD (Sum of Absolute Difference) intrinsics. Requires Compute Capability 2.0 or higher.
  • Added 0_Simple/cdpSimpleQuicksort - demonstrates a simple quicksort implemented using CUDA Dynamic Parallelism. This sample requires devices with compute capability 3.5 or higher.
  • Added 0_Simple/cdpSimplePrint - demonstrates simple printf implemented using CUDA Dynamic Parallelism. This sample requires devices with compute capability 3.5 or higher.
  • Added 6_Advanced/cdpLUDecomposition - demonstrates LU Decomposition implemented using CUDA Dynamic Parallelism. This sample requires devices with compute capability 3.5 or higher.
  • Added 6_Advanced/cdpAdvancedQuicksort - demonstrates an advanced quicksort implemented using CUDA Dynamic Parallelism. This sample requires devices with compute capability 3.5 or higher.
  • Added 6_Advanced/cdpQuadtree - demonstrates Quad Trees implemented using CUDA Dynamic Parallelism. This sample requires devices with compute capability 3.5 or higher.
  • Added 7_CUDALibraries/simpleDevLibCUBLAS - implements a simple cuBLAS function calls that call GPU device API library running cuBLAS functions. cuBLAS device code functions take advantage of CUDA Dynamic Parallelism and requires compute capability of 3.5 or higher.

1.23. CUDA 4.2

  • Added segmentationTreeThrust - demonstrates a method to build image segmentation trees using Thrust. This algorithm is based on Boruvka's MST algorithm.

1.24. CUDA 4.1

  • Added MersenneTwisterGP11213 - implements Mersenne Twister GP11213, a pseudorandom number generator using the cuRAND library.
  • Added HSOpticalFlow - When working with image sequences or video it's often useful to have information about objects movement. Optical flow describes apparent motion of objects in image sequence. This sample is a Horn-Schunck method for optical flow written using CUDA.
  • Added volumeFiltering - demonstrates basic volume rendering and filtering using 3D textures.
  • Added simpleCubeMapTexture - demonstrates how to use texcubemap fetch instruction in a CUDA C program.
  • Added simpleAssert - demonstrates how to use GPU assert in a CUDA C program.
  • Added grabcutNPP - CUDA implementation of Rother et al. GrabCut approach using the 8 neighborhood NPP Graphcut primitive introduced in CUDA 4.1. (C. Rother, V. Kolmogorov, A. Blake. GrabCut: Interactive Foreground Extraction Using Iterated Graph Cuts. ACM Transactions on Graphics (SIGGRAPH'04), 2004).

2. Getting Started

The CUDA Samples are an educational resource provided to teach CUDA programming concepts. The CUDA Samples are not meant to be used for performance measurements.

For system requirements and installation instructions, please refer to the Linux Installation Guide and the Windows Installation Guide.

2.1. Getting CUDA Samples

Windows

On Windows, the CUDA Samples are installed using the CUDA Toolkit Windows Installer. By default, the CUDA Samples are installed in:
C:\ProgramData\NVIDIA Corporation\CUDA Samples\v11.6\
The installation location can be changed at installation time.

Linux

On Linux, to install the CUDA Samples, the CUDA toolkit must first be installed. See the Linux Installation Guide for more information on how to install the CUDA Toolkit.

Then the CUDA Samples can be installed by running the following command, where <target_path> is the location where to install the samples:
$ cuda-install-samples-11.6.sh <target_path>

2.2. Building Samples

Windows

The Windows samples are built using the Visual Studio IDE. Solution files (.sln) are provided for each supported version of Visual Studio, using the format:
*_vs<version>.sln - for Visual Studio <version>
Complete samples solution files exist at:
C:\ProgramData\NVIDIA Corporation\CUDA Samples\v11.6\
Each individual sample has its own set of solution files at:
C:\ProgramData\NVIDIA Corporation\CUDA Samples\v11.6\<sample_dir>\
To build/examine all the samples at once, the complete solution files should be used. To build/examine a single sample, the individual sample solution files should be used.

Linux

The Linux samples are built using makefiles. To use the makefiles, change the current directory to the sample directory you wish to build, and run make:
$ cd <sample_dir>
$ make
The samples makefiles can take advantage of certain options:
  • TARGET_ARCH=<arch> - cross-compile targeting a specific architecture. Allowed architectures are x86_64, armv7l, aarch64, sbsa, and ppc64le.

    By default, TARGET_ARCH is set to HOST_ARCH. On a x86_64 machine, not setting TARGET_ARCH is the equvalent of setting TARGET_ARCH=x86_64.

    $ make TARGET_ARCH=x86_64
    $ make TARGET_ARCH=armv7l
    $ make TARGET_ARCH=aarch64
    $ make TARGET_ARCH=sbsa
    $ make TARGET_ARCH=ppc64le
    See here for more details.
  • dbg=1 - build with debug symbols
    $ make dbg=1
  • SMS="A B ..." - override the SM architectures for which the sample will be built, where "A B ..." is a space-delimited list of SM architectures. For example, to generate SASS for SM 35 and SM 50, use SMS="35 50".
    $ make SMS="35 50"
  • HOST_COMPILER=<host_compiler> - override the default g++ host compiler. See the Linux Installation Guide for a list of supported host compilers.
    $ make HOST_COMPILER=g++

2.3. CUDA Cross-Platform Samples

CUDA Samples are now located in https://github.com/nvidia/cuda-samples, which includes instructions for obtaining, building, and running the samples.

2.4. Using CUDA Samples to Create Your Own CUDA Projects

2.4.1. Creating CUDA Projects for Windows

Creating a new CUDA Program using the CUDA Samples infrastructure is easy. We have provided a template project that you can copy and modify to suit your needs. Just follow these steps:

(<category> refers to one of the following folders: 0_Simple, 1_Utilities, 2_Graphics, 3_Imaging, 4_Finance, 5_Simulations, 6_Advanced, 7_CUDALibraries.)

  1. Copy the content of:
    C:\ProgramData\NVIDIA Corporation\CUDA Samples\v11.6\<category>\template
    to a directory of your own:
    C:\ProgramData\NVIDIA Corporation\CUDA Samples\v11.6\<category>\myproject
  2. Edit the filenames of the project to suit your needs.
  3. Edit the *.sln, *.vcproj and source files. Just search and replace all occurrences of template with myproject.
  4. Build the 64-bit, release or debug configurations using:
    • myproject_vs<version>.sln
  5. Run myproject.exe from the release or debug directories located in:
    C:\ProgramData\NVIDIA Corporation\CUDA Samples\v11.6\bin\win64\[release|debug]
  6. Now modify the code to perform the computation you require. See the CUDA Programming Guide for details of programming in CUDA.

2.4.2. Creating CUDA Projects for Linux

Note: The default installation folder <SAMPLES_INSTALL_PATH> is NVIDIA_CUDA_11.6_Samples and <category> is one of the following: 0_Simple, 1_Utilities, 2_Graphics, 3_Imaging, 4_Finance, 5_Simulations, 6_Advanced, 7_CUDALibraries.
Creating a new CUDA Program using the NVIDIA CUDA Samples infrastructure is easy. We have provided a template project that you can copy and modify to suit your needs. Just follow these steps:
  1. Copy the template project:
    cd <SAMPLES_INSTALL_PATH>/<category>
    cp -r template <myproject>
    cd <SAMPLES_INSTALL_PATH>/<category>
    
  2. Edit the filenames of the project to suit your needs:
    mv template.cu myproject.cu
    mv template_cpu.cpp myproject_cpu.cpp
    
  3. Edit the Makefile and source files. Just search and replace all occurrences of template with myproject.
  4. Build the project as (release):
    make
    To build the project as (debug), use "make dbg=1":
    make dbg=1
  5. Run the program:
    ../../bin/x86_64/linux/release/myproject
  6. Now modify the code to perform the computation you require. See the CUDA Programming Guide for details of programming in CUDA.

3. Samples Reference

This document contains a complete listing of the code samples that are included with the NVIDIA CUDA Toolkit. It describes each code sample, lists the minimum GPU specification, and provides links to the source code and white papers if available.

The code samples are divided into the following categories:
Introduction Reference
Basic CUDA samples for beginners that illustrate key concepts with using CUDA and CUDA runtime APIs.
Utilities Reference
Utility samples that demonstrate how to query device capabilities and measure GPU/CPU bandwidth.
Concepts and Techniques Reference
Samples that demonstrate CUDA related concepts and common problem solving techniques.
CUDA Features Reference
Samples that demonstrate CUDA Features.
CUDA Libraries Reference
Samples that demonstrate how to use CUDA platform libraries (NPP, NVJPEG, NVGRAPH cuBLAS, cuFFT, cuSPARSE, cuSOLVER and cuRAND).
Domain Specific Reference
Samples that are specific to domain (Graphics, Finance, Image Processing).
Performance Reference
Samples that demonstrate performance optimization.

3.1. Introduction Reference

asyncAPI

This sample illustrates the usage of CUDA events for both GPU timing and overlapping CPU and GPU execution. Events are inserted into a stream of CUDA calls. Since CUDA stream calls are asynchronous, the CPU can perform computations while GPU is executing (including DMA memcopies between the host and device). CPU can query CUDA events to determine whether GPU has completed tasks.

c++11_cuda - C++11 CUDA

This sample demonstrates C++11 feature support in CUDA. It scans a input text file and prints no. of occurrences of x, y, z, w characters.

This sample depends on other applications or libraries to be present on the system to either build or run. If these dependencies are not available on the system, the sample will not be installed. If these dependencies are available, but not installed, the sample will waive itself at build time.

Dependencies CPP11
Supported SM Architecture SM 3.5, SM 3.7, SM 5.0, SM 5.2, SM 5.3, SM 6.0, SM 6.1, SM 6.2, SM 7.0, SM 7.2, SM 7.5, SM 8.0, SM 8.6, SM 8.7
CUDA API cudaMalloc, cudaMemset, cudaFree, cudaMemcpy
Key Concepts CPP11 CUDA
Supported OSes Linux, Windows

clock - Clock

This example shows how to use the clock function to measure the performance of block of threads of a kernel accurately.

Supported SM Architecture SM 3.5, SM 3.7, SM 5.0, SM 5.2, SM 5.3, SM 6.0, SM 6.1, SM 6.2, SM 7.0, SM 7.2, SM 7.5, SM 8.0, SM 8.6, SM 8.7
CUDA API cudaMalloc, cudaFree, cudaMemcpy
Key Concepts Performance Strategies
Supported OSes Linux, Windows

clock_nvrtc - Clock libNVRTC

This example shows how to use the clock function using libNVRTC to measure the performance of block of threads of a kernel accurately.

This sample depends on other applications or libraries to be present on the system to either build or run. If these dependencies are not available on the system, the sample will not be installed. If these dependencies are available, but not installed, the sample will waive itself at build time.

concurrentKernels - Concurrent Kernels

This sample demonstrates the use of CUDA streams for concurrent execution of several kernels on GPU device. It also illustrates how to introduce dependencies between CUDA streams with the new cudaStreamWaitEvent function.

cppIntegration - C++ Integration

This example demonstrates how to integrate CUDA into an existing C++ application, i.e. the CUDA entry point on host side is only a function which is called from C++ code and only the file containing this function is compiled with nvcc. It also demonstrates that vector types can be used from cpp.

Supported SM Architecture SM 3.5, SM 3.7, SM 5.0, SM 5.2, SM 5.3, SM 6.0, SM 6.1, SM 6.2, SM 7.0, SM 7.2, SM 7.5, SM 8.0, SM 8.6, SM 8.7
CUDA API cudaMalloc, cudaFree, cudaMemcpy
Key Concepts CPP-CUDA Integration
Supported OSes Linux, Windows

cudaOpenMP

This sample demonstrates how to use OpenMP API to write an application for multiple GPUs.

This sample depends on other applications or libraries to be present on the system to either build or run. If these dependencies are not available on the system, the sample will not be installed. If these dependencies are available, but not installed, the sample will waive itself at build time.

fp16ScalarProduct - FP16 Scalar Product

Calculates scalar product of two vectors of FP16 numbers.

This sample depends on other applications or libraries to be present on the system to either build or run. If these dependencies are not available on the system, the sample will not be installed. If these dependencies are available, but not installed, the sample will waive itself at build time.

Dependencies FP16
Supported SM Architecture SM 6.0, SM 6.1, SM 6.2, SM 7.0, SM 7.2, SM 7.5, SM 8.0, SM 8.6, SM 8.7
CUDA API cudaFree, cudaMallocHost, cudaFreeHost, cudaMalloc, cudaMemcpy, cudaGetDeviceProperties
Key Concepts CUDA Runtime API
Supported OSes Linux, Windows

matrixMul - Matrix Multiplication (CUDA Runtime API Version)

This sample implements matrix multiplication and is exactly the same as Chapter 6 of the programming guide. It has been written for clarity of exposition to illustrate various CUDA programming principles, not with the goal of providing the most performant generic kernel for matrix multiplication. To illustrate GPU performance for matrix multiply, this sample also shows how to use the new CUDA 4.0 interface for CUBLAS to demonstrate high-performance performance for matrix multiplication.

matrixMul_nvrtc - Matrix Multiplication with libNVRTC

This sample implements matrix multiplication and is exactly the same as Chapter 6 of the programming guide. It has been written for clarity of exposition to illustrate various CUDA programming principles, not with the goal of providing the most performant generic kernel for matrix multiplication. To illustrate GPU performance for matrix multiply, this sample also shows how to use the new CUDA 4.0 interface for CUBLAS to demonstrate high-performance performance for matrix multiplication.

This sample depends on other applications or libraries to be present on the system to either build or run. If these dependencies are not available on the system, the sample will not be installed. If these dependencies are available, but not installed, the sample will waive itself at build time.

matrixMulDrv - Matrix Multiplication (CUDA Driver API Version)

This sample implements matrix multiplication and uses the new CUDA 4.0 kernel launch Driver API. It has been written for clarity of exposition to illustrate various CUDA programming principles, not with the goal of providing the most performant generic kernel for matrix multiplication. CUBLAS provides high-performance matrix multiplication.

matrixMulDynlinkJIT - Matrix Multiplication (CUDA Driver API version with Dynamic Linking Version)

This sample revisits matrix multiplication using the CUDA driver API. It demonstrates how to link to CUDA driver at runtime and how to use JIT (just-in-time) compilation from PTX code. It has been written for clarity of exposition to illustrate various CUDA programming principles, not with the goal of providing the most performant generic kernel for matrix multiplication. CUBLAS provides high-performance matrix multiplication.

mergeSort - Merge Sort

This sample implements a merge sort (also known as Batcher's sort), algorithms belonging to the class of sorting networks. While generally subefficient on large sequences compared to algorithms with better asymptotic algorithmic complexity (i.e. merge sort or radix sort), may be the algorithms of choice for sorting batches of short- to mid-sized (key, value) array pairs. Refer to the excellent tutorial by H. W. Lang http://www.iti.fh-flensburg.de/lang/algorithmen/sortieren/networks/indexen.htm

simpleAssert

This CUDA Runtime API sample is a very basic sample that implements how to use the assert function in the device code. Requires Compute Capability 2.0 .

Supported SM Architecture SM 3.5, SM 3.7, SM 5.0, SM 5.2, SM 5.3, SM 6.0, SM 6.1, SM 6.2, SM 7.0, SM 7.2, SM 7.5, SM 8.0, SM 8.6, SM 8.7
CUDA API cudaDeviceSynchronize, cudaGetErrorString
Key Concepts Assert
Supported OSes Linux, Windows

simpleAssert_nvrtc - simpleAssert with libNVRTC

This CUDA Runtime API sample is a very basic sample that implements how to use the assert function in the device code. Requires Compute Capability 2.0 .

This sample depends on other applications or libraries to be present on the system to either build or run. If these dependencies are not available on the system, the sample will not be installed. If these dependencies are available, but not installed, the sample will waive itself at build time.

simpleAtomicIntrinsics_nvrtc - Simple Atomic Intrinsics with libNVRTC

A simple demonstration of global memory atomic instructions.This sample makes use of NVRTC for Runtime Compilation.

This sample depends on other applications or libraries to be present on the system to either build or run. If these dependencies are not available on the system, the sample will not be installed. If these dependencies are available, but not installed, the sample will waive itself at build time.

simpleAttributes

This CUDA Runtime API sample is a very basic example that implements how to use the stream attributes that affect L2 locality. Performance improvement due to use of L2 access policy window can only be noticed on Compute capability 8.0 or higher.

simpleAWBarrier - Simple Arrive Wait Barrier

A simple demonstration of arrive wait barriers.

This sample depends on other applications or libraries to be present on the system to either build or run. If these dependencies are not available on the system, the sample will not be installed. If these dependencies are available, but not installed, the sample will waive itself at build time.

simpleCallback - Simple CUDA Callbacks

This sample implements multi-threaded heterogeneous computing workloads with the new CPU callbacks for CUDA streams and events introduced with CUDA 5.0.

simpleCooperativeGroups - Simple Cooperative Groups

This sample is a simple code that illustrates basic usage of cooperative groups within the thread block.

simpleCUDA2GL - CUDA and OpenGL Interop of Images

This sample shows how to copy CUDA image back to OpenGL using the most efficient methods.

This sample depends on other applications or libraries to be present on the system to either build or run. If these dependencies are not available on the system, the sample will not be installed. If these dependencies are available, but not installed, the sample will waive itself at build time.

simpleDrvRuntime - Simple Driver-Runtime Interaction

A simple example which demonstrates how CUDA Driver and Runtime APIs can work together to load cuda fatbinary of vector add kernel and performing vector addition.

simpleHyperQ

This sample demonstrates the use of CUDA streams for concurrent execution of several kernels on devices which provide HyperQ (SM 3.5). Devices without HyperQ (SM 2.0 and SM 3.0) will run a maximum of two kernels concurrently.

simpleIPC

This CUDA Runtime API sample is a very basic sample that demonstrates Inter Process Communication with one process per GPU for computation. Requires Compute Capability 3.0 or higher and a Linux Operating System, or a Windows Operating System.

This sample depends on other applications or libraries to be present on the system to either build or run. If these dependencies are not available on the system, the sample will not be installed. If these dependencies are available, but not installed, the sample will waive itself at build time.

simpleMPI

Simple example demonstrating how to use MPI in combination with CUDA.

This sample depends on other applications or libraries to be present on the system to either build or run. If these dependencies are not available on the system, the sample will not be installed. If these dependencies are available, but not installed, the sample will waive itself at build time.

simpleMultiCopy - Simple Multi Copy and Compute

Supported in GPUs with Compute Capability 1.1, overlapping compute with one memcopy is possible from the host system. For Quadro and Tesla GPUs with Compute Capability 2.0, a second overlapped copy operation in either direction at full speed is possible (PCI-e is symmetric). This sample illustrates the usage of CUDA streams to achieve overlapping of kernel execution with data copies to and from the device.

simpleMultiGPU - Simple Multi-GPU

This application demonstrates how to use the new CUDA 4.0 API for CUDA context management and multi-threaded access to run CUDA kernels on multiple-GPUs.

simpleOccupancy

This sample demonstrates the basic usage of the CUDA occupancy calculator and occupancy-based launch configurator APIs by launching a kernel with the launch configurator, and measures the utilization difference against a manually configured launch.

simpleP2P - Simple Peer-to-Peer Transfers with Multi-GPU

This application demonstrates CUDA APIs that support Peer-To-Peer (P2P) copies, Peer-To-Peer (P2P) addressing, and Unified Virtual Memory Addressing (UVA) between multiple GPUs. In general, P2P is supported between two same GPUs with some exceptions, such as some Tesla and Quadro GPUs.

This sample depends on other applications or libraries to be present on the system to either build or run. If these dependencies are not available on the system, the sample will not be installed. If these dependencies are available, but not installed, the sample will waive itself at build time.

simplePrintf

This basic CUDA Runtime API sample demonstrates how to use the printf function in the device code.

simpleSeparateCompilation - Simple Static GPU Device Library

This sample demonstrates a CUDA 5.0 feature, the ability to create a GPU device static library and use it within another CUDA kernel. This example demonstrates how to pass in a GPU device function (from the GPU device static library) as a function pointer to be called. This sample requires devices with compute capability 2.0 or higher.

simpleStreams

This sample uses CUDA streams to overlap kernel executions with memory copies between the host and a GPU device. This sample uses a new CUDA 4.0 feature that supports pinning of generic host memory. Requires Compute Capability 2.0 or higher.

simpleTemplates - Simple Templates

This sample is a templatized version of the template project. It also shows how to correctly templatize dynamically allocated shared memory arrays.

simpleTemplates_nvrtc - Simple Templates with libNVRTC

This sample is a templatized version of the template project. It also shows how to correctly templatize dynamically allocated shared memory arrays.

This sample depends on other applications or libraries to be present on the system to either build or run. If these dependencies are not available on the system, the sample will not be installed. If these dependencies are available, but not installed, the sample will waive itself at build time.

simpleTexture3D - Simple Texture 3D

Simple example that demonstrates use of 3D Textures in CUDA.

This sample depends on other applications or libraries to be present on the system to either build or run. If these dependencies are not available on the system, the sample will not be installed. If these dependencies are available, but not installed, the sample will waive itself at build time.

simpleVoteIntrinsics - Simple Vote Intrinsics

Simple program which demonstrates how to use the Vote (__any_sync, __all_sync) intrinsic instruction in a CUDA kernel.

simpleVoteIntrinsics_nvrtc - Simple Vote Intrinsics with libNVRTC

Simple program which demonstrates how to use the Vote (any, all) intrinsic instruction in a CUDA kernel with runtime compilation using NVRTC APIs. Requires Compute Capability 2.0 or higher.

This sample depends on other applications or libraries to be present on the system to either build or run. If these dependencies are not available on the system, the sample will not be installed. If these dependencies are available, but not installed, the sample will waive itself at build time.

systemWideAtomics - System wide Atomics

A simple demonstration of system wide atomic instructions.

This sample depends on other applications or libraries to be present on the system to either build or run. If these dependencies are not available on the system, the sample will not be installed. If these dependencies are available, but not installed, the sample will waive itself at build time.

template - Template

A trivial template project that can be used as a starting point to create new CUDA projects.

Supported SM Architecture SM 3.5, SM 3.7, SM 5.0, SM 5.2, SM 5.3, SM 6.0, SM 6.1, SM 6.2, SM 7.0, SM 7.2, SM 7.5, SM 8.0, SM 8.6, SM 8.7
CUDA API cudaMalloc, cudaFree, cudaMemcpy
Key Concepts Device Memory Allocation
Supported OSes Linux, Windows

UnifiedMemoryStreams - Unified Memory Streams

This sample demonstrates the use of OpenMP and streams with Unified Memory on a single GPU.

This sample depends on other applications or libraries to be present on the system to either build or run. If these dependencies are not available on the system, the sample will not be installed. If these dependencies are available, but not installed, the sample will waive itself at build time.

vectorAdd - Vector Addition

This CUDA Runtime API sample is a very basic sample that implements element by element vector addition. It is the same as the sample illustrating Chapter 3 of the programming guide with some additions like error checking.

vectorAdd_nvrtc - Vector Addition with libNVRTC

This CUDA Driver API sample uses NVRTC for runtime compilation of vector addition kernel. Vector addition kernel demonstrated is the same as the sample illustrating Chapter 3 of the programming guide.

This sample depends on other applications or libraries to be present on the system to either build or run. If these dependencies are not available on the system, the sample will not be installed. If these dependencies are available, but not installed, the sample will waive itself at build time.

vectorAddDrv - Vector Addition Driver API

This Vector Addition sample is a basic sample that is implemented element by element. It is the same as the sample illustrating Chapter 3 of the programming guide with some additions like error checking. This sample also uses the new CUDA 4.0 kernel launch Driver API.

vectorAddMMAP - Vector Addition cuMemMap

This sample replaces the device allocation in the vectorAddDrv sample with cuMemMap-ed allocations. This sample demonstrates that the cuMemMap api allows the user to specify the physical properties of their memory while retaining the contiguous nature of their access, thus not requiring a change in their program structure.

3.2. Utilities Reference

bandwidthTest - Bandwidth Test

This is a simple test program to measure the memcopy bandwidth of the GPU and memcpy bandwidth across PCI-e. This test application is capable of measuring device to device copy bandwidth, host to device copy bandwidth for pageable and page-locked memory, and device to host copy bandwidth for pageable and page-locked memory.

deviceQueryDrv - Device Query Driver API

This sample enumerates the properties of the CUDA devices present using CUDA Driver API calls

topologyQuery - Topology Query

A simple exemple on how to query the topology of a system with multiple GPU

3.3. Concepts and Techniques Reference

boxFilter - Box Filter

Fast image box filter using CUDA with OpenGL rendering.

This sample depends on other applications or libraries to be present on the system to either build or run. If these dependencies are not available on the system, the sample will not be installed. If these dependencies are available, but not installed, the sample will waive itself at build time.

convolutionSeparable - CUDA Separable Convolution

This sample implements a separable convolution filter of a 2D signal with a gaussian kernel.

convolutionTexture - Texture-based Separable Convolution

Texture-based implementation of a separable 2D convolution with a gaussian kernel. Used for performance comparison against convolutionSeparable.

cuHook - CUDA Interception Library

This sample demonstrates how to build and use an intercept library with CUDA. The library has to be loaded via LD_PRELOAD, e.g. LD_PRELOAD=<full_path>/libcuhook.so.1 ./cuHook NOTE: Sample will be waived if the glibc version >= 2.34, as the sample was using these private glibc functions `__libc_dlsym()`, `__libc_dlopen_mode()` which are not exposed in 2.34 version.

dct8x8 - DCT8x8

This sample demonstrates how Discrete Cosine Transform (DCT) for blocks of 8 by 8 pixels can be performed using CUDA: a naive implementation by definition and a more traditional approach used in many libraries. As opposed to implementing DCT in a fragment shader, CUDA allows for an easier and more efficient implementation.

EGLStream_CUDA_CrossGPU

Demonstrates CUDA and EGL Streams interop, where consumer's EGL Stream is on one GPU and producer's on other and both consumer-producer are different processes.

This sample depends on other applications or libraries to be present on the system to either build or run. If these dependencies are not available on the system, the sample will not be installed. If these dependencies are available, but not installed, the sample will waive itself at build time.

EGLStream_CUDA_Interop - EGLStream CUDA Interop

Demonstrates data exchange between CUDA and EGL Streams.

This sample depends on other applications or libraries to be present on the system to either build or run. If these dependencies are not available on the system, the sample will not be installed. If these dependencies are available, but not installed, the sample will waive itself at build time.

EGLSync_CUDA_Interop - EGLSync CUDA Event Interop

Demonstrates interoperability between CUDA Event and EGL Sync/EGL Image using which one can achieve synchronization on GPU itself for GL-EGL-CUDA operations instead of blocking CPU for synchronization.

This sample depends on other applications or libraries to be present on the system to either build or run. If these dependencies are not available on the system, the sample will not be installed. If these dependencies are available, but not installed, the sample will waive itself at build time.

eigenvalues - Eigenvalues

The computation of all or a subset of all eigenvalues is an important problem in Linear Algebra, statistics, physics, and many other fields. This sample demonstrates a parallel implementation of a bisection algorithm for the computation of all eigenvalues of a tridiagonal symmetric matrix of arbitrary size with CUDA.

FunctionPointers - Function Pointers

This sample illustrates how to use function pointers and implements the Sobel Edge Detection filter for 8-bit monochrome images.

This sample depends on other applications or libraries to be present on the system to either build or run. If these dependencies are not available on the system, the sample will not be installed. If these dependencies are available, but not installed, the sample will waive itself at build time.

histogram - CUDA Histogram

This sample demonstrates efficient implementation of 64-bin and 256-bin histogram.

imageDenoising - Image denoising

This sample demonstrates two adaptive image denoising techniques: KNN and NLM, based on computation of both geometric and color distance between texels. While both techniques are implemented in the DirectX SDK using shaders, massively speeded up variation of the latter technique, taking advantage of shared memory, is implemented in addition to DirectX counterparts.

This sample depends on other applications or libraries to be present on the system to either build or run. If these dependencies are not available on the system, the sample will not be installed. If these dependencies are available, but not installed, the sample will waive itself at build time.

inlinePTX - Using Inline PTX

A simple test application that demonstrates a new CUDA 4.0 ability to embed PTX in a CUDA kernel.

inlinePTX_nvrtc - Using Inline PTX with libNVRTC

A simple test application that demonstrates a new CUDA 4.0 ability to embed PTX in a CUDA kernel.

This sample depends on other applications or libraries to be present on the system to either build or run. If these dependencies are not available on the system, the sample will not be installed. If these dependencies are available, but not installed, the sample will waive itself at build time.

interval - Interval Computing

Interval arithmetic operators example. Uses various C++ features (templates and recursion). The recursive mode requires Compute SM 2.0 capabilities.

MC_EstimatePiInlineP - Monte Carlo Estimation of Pi (inline PRNG)

This sample uses Monte Carlo simulation for Estimation of Pi (using inline PRNG). This sample also uses the NVIDIA CURAND library.

This sample depends on other applications or libraries to be present on the system to either build or run. If these dependencies are not available on the system, the sample will not be installed. If these dependencies are available, but not installed, the sample will waive itself at build time.

MC_EstimatePiInlineQ - Monte Carlo Estimation of Pi (inline QRNG)

This sample uses Monte Carlo simulation for Estimation of Pi (using inline QRNG). This sample also uses the NVIDIA CURAND library.

This sample depends on other applications or libraries to be present on the system to either build or run. If these dependencies are not available on the system, the sample will not be installed. If these dependencies are available, but not installed, the sample will waive itself at build time.

MC_EstimatePiP - Monte Carlo Estimation of Pi (batch PRNG)

This sample uses Monte Carlo simulation for Estimation of Pi (using batch PRNG). This sample also uses the NVIDIA CURAND library.

This sample depends on other applications or libraries to be present on the system to either build or run. If these dependencies are not available on the system, the sample will not be installed. If these dependencies are available, but not installed, the sample will waive itself at build time.

MC_EstimatePiQ - Monte Carlo Estimation of Pi (batch QRNG)

This sample uses Monte Carlo simulation for Estimation of Pi (using batch QRNG). This sample also uses the NVIDIA CURAND library.

This sample depends on other applications or libraries to be present on the system to either build or run. If these dependencies are not available on the system, the sample will not be installed. If these dependencies are available, but not installed, the sample will waive itself at build time.

MC_SingleAsianOptionP - Monte Carlo Single Asian Option

This sample uses Monte Carlo to simulate Single Asian Options using the NVIDIA CURAND library.

This sample depends on other applications or libraries to be present on the system to either build or run. If these dependencies are not available on the system, the sample will not be installed. If these dependencies are available, but not installed, the sample will waive itself at build time.

particles - Particles

This sample uses CUDA to simulate and visualize a large set of particles and their physical interaction. Adding "-particles=<N>" to the command line will allow users to set # of particles for simulation. This example implements a uniform grid data structure using either atomic operations or a fast radix sort from the Thrust library

This sample depends on other applications or libraries to be present on the system to either build or run. If these dependencies are not available on the system, the sample will not be installed. If these dependencies are available, but not installed, the sample will waive itself at build time.

radixSortThrust - CUDA Radix Sort (Thrust Library)

This sample demonstrates a very fast and efficient parallel radix sort uses Thrust library. The included RadixSort class can sort either key-value pairs (with float or unsigned integer keys) or keys only.

reduction - CUDA Parallel Reduction

A parallel sum reduction that computes the sum of a large arrays of values. This sample demonstrates several important optimization strategies for Data-Parallel Algorithms like reduction using shared memory, __shfl_down_sync, __reduce_add_sync and cooperative_groups reduce.

This sample depends on other applications or libraries to be present on the system to either build or run. If these dependencies are not available on the system, the sample will not be installed. If these dependencies are available, but not installed, the sample will waive itself at build time.

reductionMultiBlockCG - Reduction using MultiBlock Cooperative Groups

This sample demonstrates single pass reduction using Multi Block Cooperative Groups. This sample requires devices with compute capability 6.0 or higher having compute preemption.

This sample depends on other applications or libraries to be present on the system to either build or run. If these dependencies are not available on the system, the sample will not be installed. If these dependencies are available, but not installed, the sample will waive itself at build time.

scalarProd - Scalar Product

This sample calculates scalar products of a given set of input vector pairs.

scan - CUDA Parallel Prefix Sum (Scan)

This example demonstrates an efficient CUDA implementation of parallel prefix sum, also known as "scan". Given an array of numbers, scan computes a new array in which each element is the sum of all the elements before it in the input array.

segmentationTreeThrust - CUDA Segmentation Tree Thrust Library

This sample demonstrates an approach to the image segmentation trees construction. This method is based on Boruvka's MST algorithm.

shfl_scan - CUDA Parallel Prefix Sum with Shuffle Intrinsics (SHFL_Scan)

This example demonstrates how to use the shuffle intrinsic __shfl_up_sync to perform a scan operation across a thread block.

This sample depends on other applications or libraries to be present on the system to either build or run. If these dependencies are not available on the system, the sample will not be installed. If these dependencies are available, but not installed, the sample will waive itself at build time.

sortingNetworks - CUDA Sorting Networks

This sample implements bitonic sort and odd-even merge sort (also known as Batcher's sort), algorithms belonging to the class of sorting networks. While generally subefficient, for large sequences compared to algorithms with better asymptotic algorithmic complexity (i.e. merge sort or radix sort), this may be the preferred algorithms of choice for sorting batches of short-sized to mid-sized (key, value) array pairs. Refer to an excellent tutorial by H. W. Lang http://www.iti.fh-flensburg.de/lang/algorithmen/sortieren/networks/indexen.htm

threadFenceReduction

This sample shows how to perform a reduction operation on an array of values using the thread Fence intrinsic to produce a single value in a single kernel (as opposed to two or more kernel calls as shown in the "reduction" CUDA Sample). Single-pass reduction requires global atomic instructions (Compute Capability 2.0 or later) and the _threadfence() intrinsic (CUDA 2.2 or later).

threadMigration - CUDA Context Thread Management

Simple program illustrating how to the CUDA Context Management API and uses the new CUDA 4.0 parameter passing and CUDA launch API. CUDA contexts can be created separately and attached independently to different threads.

3.4. CUDA Features Reference

bf16TensorCoreGemm - bfloat16 Tensor Core GEMM

A CUDA sample demonstrating __nv_bfloat16 (e8m7) GEMM computation using the Warp Matrix Multiply and Accumulate (WMMA) API introduced with CUDA 11 in Ampere chip family tensor cores for faster matrix operations. This sample also uses async copy provided by cuda pipeline interface for gmem to shmem async loads which improves kernel performance and reduces register presssure.

This sample depends on other applications or libraries to be present on the system to either build or run. If these dependencies are not available on the system, the sample will not be installed. If these dependencies are available, but not installed, the sample will waive itself at build time.

binaryPartitionCG - Binary Partition Cooperative Groups

This sample is a simple code that illustrates binary partition cooperative groups and reduce within the thread block.

bindlessTexture - Bindless Texture

This example demonstrates use of cudaSurfaceObject, cudaTextureObject, and MipMap support in CUDA. A GPU with Compute Capability SM 3.0 is required to run the sample.

This sample depends on other applications or libraries to be present on the system to either build or run. If these dependencies are not available on the system, the sample will not be installed. If these dependencies are available, but not installed, the sample will waive itself at build time.

cdpAdvancedQuicksort - Advanced Quicksort (CUDA Dynamic Parallelism)

This sample demonstrates an advanced quicksort implemented using CUDA Dynamic Parallelism. This sample requires devices with compute capability 3.5 or higher.

This sample depends on other applications or libraries to be present on the system to either build or run. If these dependencies are not available on the system, the sample will not be installed. If these dependencies are available, but not installed, the sample will waive itself at build time.

cdpBezierTessellation - Bezier Line Tessellation (CUDA Dynamic Parallelism)

This sample demonstrates bezier tessellation of lines implemented using CUDA Dynamic Parallelism. This sample requires devices with compute capability 3.5 or higher.

This sample depends on other applications or libraries to be present on the system to either build or run. If these dependencies are not available on the system, the sample will not be installed. If these dependencies are available, but not installed, the sample will waive itself at build time.

cdpQuadtree - Quad Tree (CUDA Dynamic Parallelism)

This sample demonstrates Quad Trees implemented using CUDA Dynamic Parallelism. This sample requires devices with compute capability 3.5 or higher.

This sample depends on other applications or libraries to be present on the system to either build or run. If these dependencies are not available on the system, the sample will not be installed. If these dependencies are available, but not installed, the sample will waive itself at build time.

cdpSimplePrint - Simple Print (CUDA Dynamic Parallelism)

This sample demonstrates simple printf implemented using CUDA Dynamic Parallelism. This sample requires devices with compute capability 3.5 or higher.

This sample depends on other applications or libraries to be present on the system to either build or run. If these dependencies are not available on the system, the sample will not be installed. If these dependencies are available, but not installed, the sample will waive itself at build time.

cdpSimpleQuicksort - Simple Quicksort (CUDA Dynamic Parallelism)

This sample demonstrates simple quicksort implemented using CUDA Dynamic Parallelism. This sample requires devices with compute capability 3.5 or higher.

This sample depends on other applications or libraries to be present on the system to either build or run. If these dependencies are not available on the system, the sample will not be installed. If these dependencies are available, but not installed, the sample will waive itself at build time.

cudaTensorCoreGemm - CUDA Tensor Core GEMM

CUDA sample demonstrating a GEMM computation using the Warp Matrix Multiply and Accumulate (WMMA) API introduced in CUDA 9. This sample demonstrates the use of the new CUDA WMMA API employing the Tensor Cores introduced in the Volta chip family for faster matrix operations. In addition to that, it demonstrates the use of the new CUDA function attribute cudaFuncAttributeMaxDynamicSharedMemorySize that allows the application to reserve an extended amount of shared memory than it is available by default.

dmmaTensorCoreGemm - Double Precision Tensor Core GEMM

CUDA sample demonstrates double precision GEMM computation using the Double precision Warp Matrix Multiply and Accumulate (WMMA) API introduced with CUDA 11 in Ampere chip family tensor cores for faster matrix operations. This sample also uses async copy provided by cuda pipeline interface for gmem to shmem async loads which improves kernel performance and reduces register presssure. Further, this sample also demonstrates how to use cooperative groups async copy interface over a group for performing gmem to shmem async loads.

This sample depends on other applications or libraries to be present on the system to either build or run. If these dependencies are not available on the system, the sample will not be installed. If these dependencies are available, but not installed, the sample will waive itself at build time.

globalToShmemAsyncCopy - Global Memory to Shared Memory Async Copy

This sample implements matrix multiplication which uses asynchronous copy of data from global to shared memory when on compute capability 8.0 or higher. Also demonstrates arrive-wait barrier for synchronization.

This sample depends on other applications or libraries to be present on the system to either build or run. If these dependencies are not available on the system, the sample will not be installed. If these dependencies are available, but not installed, the sample will waive itself at build time.

immaTensorCoreGemm - Tensor Core GEMM Integer MMA

CUDA sample demonstrating a integer GEMM computation using the Warp Matrix Multiply and Accumulate (WMMA) API for integer introduced in CUDA 10. This sample demonstrates the use of the CUDA WMMA API employing the Tensor Cores introduced in the Volta chip family for faster matrix operations. In addition to that, it demonstrates the use of the new CUDA function attribute cudaFuncAttributeMaxDynamicSharedMemorySize that allows the application to reserve an extended amount of shared memory than it is available by default.

memMapIPCDrv - Memmap IPC Driver API

This CUDA Driver API sample is a very basic sample that demonstrates Inter Process Communication using cuMemMap APIs with one process per GPU for computation. Requires Compute Capability 3.0 or higher and a Linux Operating System, or a Windows Operating System

This sample depends on other applications or libraries to be present on the system to either build or run. If these dependencies are not available on the system, the sample will not be installed. If these dependencies are available, but not installed, the sample will waive itself at build time.

newdelete - NewDelete

This sample demonstrates dynamic global memory allocation through device C++ new and delete operators and virtual function declarations available with CUDA 4.0.

ptxjit - PTX Just-in-Time compilation

This sample uses the Driver API to just-in-time compile (JIT) a Kernel from PTX code. Additionally, this sample demonstrates the seamless interoperability capability of the CUDA Runtime and CUDA Driver API calls. For CUDA 5.5, this sample shows how to use cuLink* functions to link PTX assembly using the CUDA driver at runtime.

StreamPriorities - Stream Priorities

This sample demonstrates basic use of stream priorities.

This sample depends on other applications or libraries to be present on the system to either build or run. If these dependencies are not available on the system, the sample will not be installed. If these dependencies are available, but not installed, the sample will waive itself at build time.

tf32TensorCoreGemm - tf32 Tensor Core GEMM

A CUDA sample demonstrating tf32 (e8m10) GEMM computation using the Warp Matrix Multiply and Accumulate (WMMA) API introduced with CUDA 11 in Ampere chip family tensor cores for faster matrix operations. This sample also uses async copy provided by cuda pipeline interface for gmem to shmem async loads which improves kernel performance and reduces register presssure.

This sample depends on other applications or libraries to be present on the system to either build or run. If these dependencies are not available on the system, the sample will not be installed. If these dependencies are available, but not installed, the sample will waive itself at build time.

warpAggregatedAtomicsCG - Warp Aggregated Atomics using Cooperative Groups

This sample demonstrates how using Cooperative Groups (CG) to perform warp aggregated atomics to single and multiple counters, a useful technique to improve performance when many threads atomically add to a single or multiple counters.

3.5. CUDA Libraries Reference

batchCUBLAS

A CUDA Sample that demonstrates how using batched CUBLAS API calls to improve overall performance.

This sample depends on other applications or libraries to be present on the system to either build or run. If these dependencies are not available on the system, the sample will not be installed. If these dependencies are available, but not installed, the sample will waive itself at build time.

batchedLabelMarkersAndLabelCompressionNPP - Batched Label Markers And Label Compression NPP

An NPP CUDA Sample that demonstrates how to use the NPP label markers generation and label compression functions based on a Union Find (UF) algorithm including both single image and batched image versions.

This sample depends on other applications or libraries to be present on the system to either build or run. If these dependencies are not available on the system, the sample will not be installed. If these dependencies are available, but not installed, the sample will waive itself at build time.

- Box Filter with NPP

A NPP CUDA Sample that demonstrates how to use NPP FilterBox function to perform a Box Filter.

This sample depends on other applications or libraries to be present on the system to either build or run. If these dependencies are not available on the system, the sample will not be installed. If these dependencies are available, but not installed, the sample will waive itself at build time.

cannyEdgeDetectorNPP - Canny Edge Detector NPP

An NPP CUDA Sample that demonstrates the recommended parameters to use with the nppiFilterCannyBorder_8u_C1R Canny Edge Detection image filter function. This function expects a single channel 8-bit grayscale input image. You can generate a grayscale image from a color image by first calling nppiColorToGray() or nppiRGBToGray(). The Canny Edge Detection function combines and improves on the techniques required to produce an edge detection image using multiple steps.

This sample depends on other applications or libraries to be present on the system to either build or run. If these dependencies are not available on the system, the sample will not be installed. If these dependencies are available, but not installed, the sample will waive itself at build time.

conjugateGradient - ConjugateGradient

This sample implements a conjugate gradient solver on GPU using CUBLAS and CUSPARSE library.

This sample depends on other applications or libraries to be present on the system to either build or run. If these dependencies are not available on the system, the sample will not be installed. If these dependencies are available, but not installed, the sample will waive itself at build time.

conjugateGradientCudaGraphs - Conjugate Gradient using Cuda Graphs

This sample implements a conjugate gradient solver on GPU using CUBLAS and CUSPARSE library calls captured and called using CUDA Graph APIs.

This sample depends on other applications or libraries to be present on the system to either build or run. If these dependencies are not available on the system, the sample will not be installed. If these dependencies are available, but not installed, the sample will waive itself at build time.

conjugateGradientMultiBlockCG - conjugateGradient using MultiBlock Cooperative Groups

This sample implements a conjugate gradient solver on GPU using Multi Block Cooperative Groups, also uses Unified Memory.

This sample depends on other applications or libraries to be present on the system to either build or run. If these dependencies are not available on the system, the sample will not be installed. If these dependencies are available, but not installed, the sample will waive itself at build time.

conjugateGradientMultiDeviceCG - conjugateGradient using MultiDevice Cooperative Groups

This sample implements a conjugate gradient solver on multiple GPUs using Multi Device Cooperative Groups, also uses Unified Memory optimized using prefetching and usage hints.

This sample depends on other applications or libraries to be present on the system to either build or run. If these dependencies are not available on the system, the sample will not be installed. If these dependencies are available, but not installed, the sample will waive itself at build time.

conjugateGradientPrecond - Preconditioned Conjugate Gradient

This sample implements a preconditioned conjugate gradient solver on GPU using CUBLAS and CUSPARSE library.

This sample depends on other applications or libraries to be present on the system to either build or run. If these dependencies are not available on the system, the sample will not be installed. If these dependencies are available, but not installed, the sample will waive itself at build time.

conjugateGradientUM - ConjugateGradientUM

This sample implements a conjugate gradient solver on GPU using CUBLAS and CUSPARSE library, using Unified Memory

This sample depends on other applications or libraries to be present on the system to either build or run. If these dependencies are not available on the system, the sample will not be installed. If these dependencies are available, but not installed, the sample will waive itself at build time.

cudaNvSci - CUDA NvSciBuf/NvSciSync Interop

This sample demonstrates CUDA-NvSciBuf/NvSciSync Interop. Two CPU threads import the NvSciBuf and NvSciSync into CUDA to perform two image processing algorithms on a ppm image - image rotation in 1st thread &amp;amp;amp;amp;amp;amp;amp;amp;amp;amp; rgba to grayscale conversion of rotated image in 2nd thread. Currently only supported on Ubuntu 18.04

This sample depends on other applications or libraries to be present on the system to either build or run. If these dependencies are not available on the system, the sample will not be installed. If these dependencies are available, but not installed, the sample will waive itself at build time.

- NvMedia CUDA Interop

This sample demonstrates CUDA-NvMedia interop via NvSciBuf/NvSciSync APIs. Note that this sample only supports cross build from x86_64 to aarch64, aarch64 native build is not supported. For detailed workflow of the sample please check cudaNvSciNvMedia_Readme.pdf in the sample directory.

This sample depends on other applications or libraries to be present on the system to either build or run. If these dependencies are not available on the system, the sample will not be installed. If these dependencies are available, but not installed, the sample will waive itself at build time.

- cuDLA Hybrid Mode

This sample demonstrates cuDLA hybrid mode wherein DLA can be programmed using CUDA.

- cuDLA Standalone Mode

This sample demonstrates cuDLA standalone mode wherein DLA can be programmed without using CUDA.

This sample depends on other applications or libraries to be present on the system to either build or run. If these dependencies are not available on the system, the sample will not be installed. If these dependencies are available, but not installed, the sample will waive itself at build time.

Dependencies NVSCI
Supported SM Architecture SM 6.0, SM 6.1, SM 6.2, SM 7.0, SM 7.2, SM 7.5, SM 8.0, SM 8.6, SM 8.7
Key Concepts cuDLA, Data Parallel Algorithms, Image Processing
Supported OSes Linux

- cuSolverDn Linear Solver

A CUDA Sample that demonstrates cuSolverDN's LU, QR and Cholesky factorization.

This sample depends on other applications or libraries to be present on the system to either build or run. If these dependencies are not available on the system, the sample will not be installed. If these dependencies are available, but not installed, the sample will waive itself at build time.

cuSolverRf - cuSolverRf Refactorization

A CUDA Sample that demonstrates cuSolver's refactorization library - CUSOLVERRF.

This sample depends on other applications or libraries to be present on the system to either build or run. If these dependencies are not available on the system, the sample will not be installed. If these dependencies are available, but not installed, the sample will waive itself at build time.

- cuSolverSp Linear Solver

A CUDA Sample that demonstrates cuSolverSP's LU, QR and Cholesky factorization.

This sample depends on other applications or libraries to be present on the system to either build or run. If these dependencies are not available on the system, the sample will not be installed. If these dependencies are available, but not installed, the sample will waive itself at build time.

cuSolverSp_LowlevelCholesky - cuSolverSp LowlevelCholesky Solver

A CUDA Sample that demonstrates Cholesky factorization using cuSolverSP's low level APIs.

This sample depends on other applications or libraries to be present on the system to either build or run. If these dependencies are not available on the system, the sample will not be installed. If these dependencies are available, but not installed, the sample will waive itself at build time.

cuSolverSp_LowlevelQR - cuSolverSp Lowlevel QR Solver

A CUDA Sample that demonstrates QR factorization using cuSolverSP's low level APIs.

This sample depends on other applications or libraries to be present on the system to either build or run. If these dependencies are not available on the system, the sample will not be installed. If these dependencies are available, but not installed, the sample will waive itself at build time.

FilterBorderControlNPP - Filter Border Control NPP

This sample demonstrates how any border version of an NPP filtering function can be used in the most common mode, with border control enabled. Mentioned functions can be used to duplicate the results of the equivalent non-border version of the NPP functions. They can be also used for enabling and disabling border control on various source image edges depending on what portion of the source image is being used as input.

This sample depends on other applications or libraries to be present on the system to either build or run. If these dependencies are not available on the system, the sample will not be installed. If these dependencies are available, but not installed, the sample will waive itself at build time.

freeImageInteropNPP - FreeImage and NPP Interopability

A simple CUDA Sample demonstrate how to use FreeImage library with NPP.

This sample depends on other applications or libraries to be present on the system to either build or run. If these dependencies are not available on the system, the sample will not be installed. If these dependencies are available, but not installed, the sample will waive itself at build time.

histEqualizationNPP - Histogram Equalization with NPP

This CUDA Sample demonstrates how to use NPP for histogram equalization for image data.

This sample depends on other applications or libraries to be present on the system to either build or run. If these dependencies are not available on the system, the sample will not be installed. If these dependencies are available, but not installed, the sample will waive itself at build time.

lineOfSight - Line of Sight

This sample is an implementation of a simple line-of-sight algorithm: Given a height map and a ray originating at some observation point, it computes all the points along the ray that are visible from the observation point. The implementation is based on the Thrust library.

matrixMulCUBLAS - Matrix Multiplication (CUBLAS)

This sample implements matrix multiplication from Chapter 3 of the programming guide. To illustrate GPU performance for matrix multiply, this sample also shows how to use the new CUDA 4.0 interface for CUBLAS to demonstrate high-performance performance for matrix multiplication.

This sample depends on other applications or libraries to be present on the system to either build or run. If these dependencies are not available on the system, the sample will not be installed. If these dependencies are available, but not installed, the sample will waive itself at build time.

- MersenneTwisterGP11213

This sample demonstrates the Mersenne Twister random number generator GP11213 in cuRAND.

This sample depends on other applications or libraries to be present on the system to either build or run. If these dependencies are not available on the system, the sample will not be installed. If these dependencies are available, but not installed, the sample will waive itself at build time.

- NVJPEG simple

A CUDA Sample that demonstrates single and batched decoding of jpeg images using NVJPEG Library.

This sample depends on other applications or libraries to be present on the system to either build or run. If these dependencies are not available on the system, the sample will not be installed. If these dependencies are available, but not installed, the sample will waive itself at build time.

- NVJPEG Encoder

A CUDA Sample that demonstrates single encoding of jpeg images using NVJPEG Library.

This sample depends on other applications or libraries to be present on the system to either build or run. If these dependencies are not available on the system, the sample will not be installed. If these dependencies are available, but not installed, the sample will waive itself at build time.

oceanFFT - CUDA FFT Ocean Simulation

This sample simulates an Ocean height field using CUFFT Library and renders the result using OpenGL.

This sample depends on other applications or libraries to be present on the system to either build or run. If these dependencies are not available on the system, the sample will not be installed. If these dependencies are available, but not installed, the sample will waive itself at build time.

randomFog - Random Fog

This sample illustrates pseudo- and quasi- random numbers produced by CURAND.

This sample depends on other applications or libraries to be present on the system to either build or run. If these dependencies are not available on the system, the sample will not be installed. If these dependencies are available, but not installed, the sample will waive itself at build time.

Dependencies X11, GL, CURAND
Supported SM Architecture SM 3.5, SM 3.7, SM 5.0, SM 5.2, SM 5.3, SM 6.0, SM 6.1, SM 6.2, SM 7.0, SM 7.2, SM 7.5, SM 8.0, SM 8.6, SM 8.7
CUDA API cudaMalloc, cudaFree, cudaMemcpy, cudaGetErrorString
Key Concepts 3D Graphics, CURAND Library
Supported OSes Linux, Windows

- Simple CUBLAS

Example of using CUBLAS API interface to perform GEMM operations.

This sample depends on other applications or libraries to be present on the system to either build or run. If these dependencies are not available on the system, the sample will not be installed. If these dependencies are available, but not installed, the sample will waive itself at build time.

Dependencies CUBLAS
Supported SM Architecture SM 3.5, SM 3.7, SM 5.0, SM 5.2, SM 5.3, SM 6.0, SM 6.1, SM 6.2, SM 7.0, SM 7.2, SM 7.5, SM 8.0, SM 8.6, SM 8.7
CUDA API cudaMalloc, cudaFree
Key Concepts Image Processing, CUBLAS Library
Supported OSes Linux, Windows

simpleCUBLAS_LU - Simple CUBLAS LU

CUDA sample demonstrating cuBLAS API cublasDgetrfBatched() for lower-upper (LU) decomposition of a matrix.

This sample depends on other applications or libraries to be present on the system to either build or run. If these dependencies are not available on the system, the sample will not be installed. If these dependencies are available, but not installed, the sample will waive itself at build time.

Dependencies CUBLAS
Supported SM Architecture SM 3.5, SM 3.7, SM 5.0, SM 5.2, SM 5.3, SM 6.0, SM 6.1, SM 6.2, SM 7.0, SM 7.2, SM 7.5, SM 8.0, SM 8.6, SM 8.7
CUDA API cudaGetErrorEnum, cudaFree, cudaMalloc, cudaMemcpy
Key Concepts CUBLAS Library, LU decomposition
Supported OSes Linux, Windows

simpleCUBLASXT - Simple CUBLAS XT

Example of using CUBLAS-XT library which performs GEMM operations over Multiple GPUs.

This sample depends on other applications or libraries to be present on the system to either build or run. If these dependencies are not available on the system, the sample will not be installed. If these dependencies are available, but not installed, the sample will waive itself at build time.

Dependencies CUBLAS
Supported SM Architecture SM 3.5, SM 3.7, SM 5.0, SM 5.2, SM 5.3, SM 6.0, SM 6.1, SM 6.2, SM 7.0, SM 7.2, SM 7.5, SM 8.0, SM 8.6, SM 8.7
CUDA API cudaFree, cudaGetDeviceCount, cudaGetDeviceProperties
Key Concepts CUBLAS-XT Library
Supported OSes Linux, Windows

simpleCUFFT - Simple CUFFT

Example of using CUFFT. In this example, CUFFT is used to compute the 1D-convolution of some signal with some filter by transforming both into frequency domain, multiplying them together, and transforming the signal back to time domain. cuFFT plans are created using simple and advanced API functions.

This sample depends on other applications or libraries to be present on the system to either build or run. If these dependencies are not available on the system, the sample will not be installed. If these dependencies are available, but not installed, the sample will waive itself at build time.

Dependencies CUFFT
Supported SM Architecture SM 3.5,