5.1. Compute Capabilities#

The general specifications and features of a compute device depend on its compute capability (see Compute Capability and Streaming Multiprocessor Versions).

Table 29, Table 30, and Table 31 show the features and technical specifications associated with each compute capability that is currently supported.

All NVIDIA GPU architectures use a little-endian representation.

5.1.1. Obtain the GPU Compute Capability#

The CUDA GPU Compute Capability page provides a comprehensive mapping from NVIDIA GPU models to their compute capability.

Alternatively, the nvidia-smi tool, provided with the NVIDIA Driver, can be used to get the compute capability of a GPU. For example, the following command will output the GPU names and compute capabilities available on the system:

nvidia-smi --query-gpu=name,compute_cap

At runtime, the compute capability can be obtained using the CUDA Runtime API cudaDeviceGetAttribute() , CUDA Driver API cuDeviceGetAttribute(), or NVML API nvmlDeviceGetCudaComputeCapability():

#include <cuda_runtime_api.h>

int computeCapabilityMajor, computeCapabilityMinor;
cudaDeviceGetAttribute(&computeCapabilityMajor, cudaDevAttrComputeCapabilityMajor, device_id);
cudaDeviceGetAttribute(&computeCapabilityMinor, cudaDevAttrComputeCapabilityMinor, device_id);
#include <cuda.h>

int computeCapabilityMajor, computeCapabilityMinor;
cuDeviceGetAttribute(&computeCapabilityMajor, CU_DEVICE_ATTRIBUTE_COMPUTE_CAPABILITY_MAJOR, device_id);
cuDeviceGetAttribute(&computeCapabilityMinor, CU_DEVICE_ATTRIBUTE_COMPUTE_CAPABILITY_MINOR, device_id);
#include <nvml.h> // required linking with -lnvidia-ml

int computeCapabilityMajor, computeCapabilityMinor;
nvmlDeviceGetCudaComputeCapability(nvmlDevice, &computeCapabilityMajor, &computeCapabilityMinor);

5.1.2. Feature Availability#

Most compute features introduced with a compute architecture are intended to be available on all subsequent architectures. This is shown in Table 29 by the “yes” for availability of a feature on compute capabilities subsequent to its introduction.

5.1.2.1. Architecture-Specific Features#

Beginning with devices of Compute Capability 9.0, specialized compute features that are introduced with an architecture may not be guaranteed to be available on all subsequent compute capabilities. These features are called architecture-specific features and target acceleration of specialized operations, such as Tensor Core operations, which are not intended for all classes of compute capabilities or may significantly change in future generations. Code must be compiled with an architecture-specific compiler target (see Feature Set Compiler Targets) to enable architecture-specific features. Code compiled with an architecture-specific compiler target can only be run on the exact compute capability it was compiled for.

5.1.2.2. Family-Specific Features#

Beginning with devices of Compute Capability 10.0, some architecture-specific features are common to devices of more than one compute capability. The devices that contain these features are part of the same family and these features can also be called family-specific features. Family-specific features are guaranteed to be available on all devices in the same family. A family-specific compiler target is required to enable family-specific features. See Section 5.1.2.3. Code compiled for a family-specific target can only be run on GPUs which are members of that family.

5.1.2.3. Feature Set Compiler Targets#

There are three sets of compute features which the compiler can target:

Baseline Feature Set: The predominant set of compute features that are introduced with the intent to be available for subsequent compute architectures. These features and their availability are summarized in Table 29.

Architecture-Specific Feature Set: A small and highly specialized set of features called architecture-specific, that are introduced to accelerate specialized operations, which are not guaranteed to be available or might change significantly on subsequent compute architectures. These features are summarized in the respective “Compute Capability #.#” subsections. The architecture-specific feature set is a superset of the family-specific feature set. Architecture-specific compiler targets were introduced with Compute Capability 9.0 devices and are selected by using an a suffix in the compilation target, for example by specifying compute_100a or compute_120a as the compute target.

Family-Specific Feature Set: Some architecture-specific features are common to GPUs of more than one compute capability. These features are summarized in the respective “Compute Capability #.#” subsections. With a few exceptions, later-generation devices with the same major compute capability are in the same family. Table 28 indicates the compatibility of family-specific targets with device compute capability, including exceptions. The family-specific feature set is a superset of the baseline feature set. Family-specific compiler targets were introduced with Compute Capability 10.0 devices and are selected by using an f suffix in the compilation target, for example by specifying compute_100f or compute_120f as the compute target.

All devices starting from compute capability 9.0 have a set of features that are architecture-specific. To utilize the complete set of these features on a specific GPU, the architecture-specific compiler target with the suffix a must be used. Additionally, starting from compute capability 10.0, there are sets of features that appear in multiple devices with different minor compute capabilities. These sets of instructions are called family-specific features, and the devices which share these features are said to be part of the same family. The family-specific features are a subset of the architecture-specific features that are shared by all members of that GPU family. The family-specific compiler target with the suffix f allows the compiler to generate code that uses this common subset of architecture-specific features.

For example:

  • The compute_100 compilation target does not allow the use of architecture-specific features. This target will be compatible with all devices of compute capability 10.0 and later.

  • The compute_100f family-specific compilation target allows the use of the subset of architecture-specific features that are common across the GPU family. This target will only be compatible with devices that are part of the GPU family. In this example, it is compatible with devices of Compute Capability 10.0 and Compute Capability 10.3. The features available in the family-specific compute_100f target are a superset of the features available in the baseline compute_100 target.

  • The compute_100a architecture-specific compilation target allows the use of the complete set of architecture-specific features in Compute Capability 10.0 devices. This target will only be compatible with devices of Compute Capability 10.0 and no others. The features available in the compute_100a target form a superset of the features available in the compute_100f target.

Table 28 Family-Specific Compatibility#

Compilation Target

Compatible with Compute Capability

compute_100f

10.0

10.3

compute_103f

10.3 [1]

compute_110f

11.0 [1]

compute_120f

12.0

12.1

compute_121f

12.1 [1]

5.1.3. Features and Technical Specifications#

Table 29 Feature Support per Compute Capability#

Feature Support

Compute Capability

(Unlisted features are supported for all compute capabilities)

7.x

8.x

9.0

10.x

11.0

12.x

Atomic functions operating on 128-bit integer values in shared and global memory (Atomic Functions)

No

Yes

Atomic addition operating on float2 and float4 floating point vectors in global memory (atomicAdd())

No

Yes

Warp reduce functions (Warp Reduce Functions)

No

Yes

Bfloat16-precision floating-point operations

No

Yes

128-bit-precision floating-point operations

No

Yes

Hardware-accelerated memcpy_async (Pipelines)

No

Yes

Hardware-accelerated Split Arrive/Wait Barrier (Asynchronous Barriers)

No

Yes

L2 Cache Residency Management (L2 Cache Control)

No

Yes

DPX Instructions for Accelerated Dynamic Programming (Dynamic Programming eXtension (DPX) Instructions)

Multiple Instr.

Native

Multiple Instr.

Distributed Shared Memory

No

Yes

Thread Block Cluster (Thread Block Clusters)

No

Yes

Tensor Memory Accelerator (TMA) unit (Using the Tensor Memory Accelerator (TMA))

No

Yes

Note that the KB and K units used in the following tables correspond to 1024 bytes (i.e., a KiB) and 1024 respectively.

Table 30 Device and Streaming Multiprocessor (SM) Information per Compute Capability#

Compute Capability

7.5

8.0

8.6

8.7

8.9

9.0

10.0

10.3

11.0

12.x

Ratio of FP32 to FP64 Throughput [2]

32:1

2:1

64:1

2:1

64:1

Maximum number of resident grids per device (Concurrent Kernel Execution)

128

Maximum dimensionality of a grid

3

Maximum x-dimension of a grid

231-1

Maximum y- or z-dimension of a grid

65535

Maximum dimensionality of a thread block

3

Maximum x- or y-dimensionality of a thread block

1024

Maximum z-dimension of a thread block

64

Maximum number of threads per block

1024

Warp size

32

Maximum number of resident blocks per SM

16

32

16

24

32

24

Maximum number of resident warps per SM

32

64

48

64

48

Maximum number of resident threads per SM

1024

2048

1536

2048

1536

Green contexts: minimum SM partition size for useFlags 0

2

4

8

Green contexts: SM co-scheduled alignment per partition for useFlags 0

2

8

Table 31 Memory Information per Compute Capability#

Compute Capability

7.5

8.0

8.6

8.7

8.9

9.0

10.x

11.0

12.x

Number of 32-bit registers per SM

64 K

Maximum number of 32-bit registers per thread block

64 K

Maximum number of 32-bit registers per thread

255

Maximum amount of shared memory per SM

64 KB

164 KB

100 KB

164 KB

100 KB

228 KB

100 KB

Maximum amount of shared memory per thread block [3]

64 KB

163 KB

99 KB

163 KB

99 KB

227 KB

99 KB

Number of shared memory banks

32

Maximum amount of local memory per thread

512 KB

Constant memory size

64 KB

Cache working set per SM for constant memory

8 KB

Cache working set per SM for texture memory

32 or 64 KB

28 KB ~ 192 KB

28 KB ~ 128 KB

28 KB ~ 192 KB

28 KB ~ 128 KB

28 KB ~ 256 KB

28 KB ~ 128 KB

Table 32 Shared Memory Capacity per Compute Capability#

Compute Capability

Unified Data Cache Size (KB)

SMEM Capacity Sizes (KB)

7.5

96

32, 64

8.0

192

0, 8, 16, 32, 64, 100, 132, 164

8.6

128

0, 8, 16, 32, 64, 100

8.7

192

0, 8, 16, 32, 64, 100, 132, 164

8.9

128

0, 8, 16, 32, 64, 100

9.0

256

0, 8, 16, 32, 64, 100, 132, 164, 196, 228

10.x

256

0, 8, 16, 32, 64, 100, 132, 164, 196, 228

11.0

256

0, 8, 16, 32, 64, 100, 132, 164, 196, 228

12.x

128

0, 8, 16, 32, 64, 100

Table 33 shows the input data types supported by Tensor Core acceleration. The Tensor Core feature set is available within the CUDA compilation toolchain through inline PTX. It is strongly recommended that applications use this feature set through CUDA-X libraries such as cuDNN, cuBLAS, and cuFFT, for example, or through CUTLASS, a collection of CUDA C++ template abstractions and Python domain-specific languages (DSLs) designed to enable high-performance matrix-matrix multiplication (GEMM) and related computations across all levels within CUDA.

Table 33 Input Data Types Supported by Tensor Core Acceleration per Compute Capability#

Compute Capability

Tensor Core Input Data Types

FP64

TF32

BF16

FP16

FP8

FP6

FP4

INT8

INT4

7.5

Yes

Yes

Yes

8.0

Yes

Yes

Yes

Yes

Yes

Yes

8.6

Yes

Yes

Yes

Yes

Yes

8.7

Yes

Yes

Yes

Yes

Yes

8.9

Yes

Yes

Yes

Yes

Yes

Yes

9.0

Yes

Yes

Yes

Yes

Yes

Yes

10.0

Yes

Yes

Yes

Yes

Yes

Yes

Yes

Yes

10.3

Yes

Yes

Yes

Yes

Yes

Yes

Yes

11.0

Yes

Yes

Yes

Yes

Yes

Yes

Yes

12.x

Yes

Yes

Yes

Yes

Yes

Yes

Yes