Contents
- 1. Introduction
- 2. Programming Model
- 3. PTX Machine Model
- 4. Syntax
-
5. State Spaces, Types, and Variables
- 5.1. State Spaces
- 5.2. Types
- 5.3. Texture Sampler and Surface Types
- 5.4. Variables
- 5.5. Tensors
- 6. Instruction Operands
- 7. Abstracting the ABI
- 8. Memory Consistency Model
-
9. Instruction Set
- 9.1. Format and Semantics of Instruction Descriptions
- 9.2. PTX Instructions
- 9.3. Predicated Execution
- 9.4. Type Information for Instructions and Operands
- 9.5. Divergence of Threads in Control Constructs
- 9.6. Semantics
-
9.7. Instructions
-
9.7.1. Integer Arithmetic Instructions
- 9.7.1.1. Integer Arithmetic Instructions: add
- 9.7.1.2. Integer Arithmetic Instructions: sub
- 9.7.1.3. Integer Arithmetic Instructions: mul
- 9.7.1.4. Integer Arithmetic Instructions: mad
- 9.7.1.5. Integer Arithmetic Instructions: mul24
- 9.7.1.6. Integer Arithmetic Instructions: mad24
- 9.7.1.7. Integer Arithmetic Instructions: sad
- 9.7.1.8. Integer Arithmetic Instructions: div
- 9.7.1.9. Integer Arithmetic Instructions: rem
- 9.7.1.10. Integer Arithmetic Instructions: abs
- 9.7.1.11. Integer Arithmetic Instructions: neg
- 9.7.1.12. Integer Arithmetic Instructions: min
- 9.7.1.13. Integer Arithmetic Instructions: max
- 9.7.1.14. Integer Arithmetic Instructions: popc
- 9.7.1.15. Integer Arithmetic Instructions: clz
- 9.7.1.16. Integer Arithmetic Instructions: bfind
- 9.7.1.17. Integer Arithmetic Instructions: fns
- 9.7.1.18. Integer Arithmetic Instructions: brev
- 9.7.1.19. Integer Arithmetic Instructions: bfe
- 9.7.1.20. Integer Arithmetic Instructions: bfi
- 9.7.1.21. Integer Arithmetic Instructions: szext
- 9.7.1.22. Integer Arithmetic Instructions: bmsk
- 9.7.1.23. Integer Arithmetic Instructions: dp4a
- 9.7.1.24. Integer Arithmetic Instructions: dp2a
-
9.7.2. Extended-Precision Integer Arithmetic Instructions
- 9.7.2.1. Extended-Precision Arithmetic Instructions: add.cc
- 9.7.2.2. Extended-Precision Arithmetic Instructions: addc
- 9.7.2.3. Extended-Precision Arithmetic Instructions: sub.cc
- 9.7.2.4. Extended-Precision Arithmetic Instructions: subc
- 9.7.2.5. Extended-Precision Arithmetic Instructions: mad.cc
- 9.7.2.6. Extended-Precision Arithmetic Instructions: madc
-
9.7.3. Floating-Point Instructions
- 9.7.3.1. Floating Point Instructions: testp
- 9.7.3.2. Floating Point Instructions: copysign
- 9.7.3.3. Floating Point Instructions: add
- 9.7.3.4. Floating Point Instructions: sub
- 9.7.3.5. Floating Point Instructions: mul
- 9.7.3.6. Floating Point Instructions: fma
- 9.7.3.7. Floating Point Instructions: mad
- 9.7.3.8. Floating Point Instructions: div
- 9.7.3.9. Floating Point Instructions: abs
- 9.7.3.10. Floating Point Instructions: neg
- 9.7.3.11. Floating Point Instructions: min
- 9.7.3.12. Floating Point Instructions: max
- 9.7.3.13. Floating Point Instructions: rcp
- 9.7.3.14. Floating Point Instructions: rcp.approx.ftz.f64
- 9.7.3.15. Floating Point Instructions: sqrt
- 9.7.3.16. Floating Point Instructions: rsqrt
- 9.7.3.17. Floating Point Instructions: rsqrt.approx.ftz.f64
- 9.7.3.18. Floating Point Instructions: sin
- 9.7.3.19. Floating Point Instructions: cos
- 9.7.3.20. Floating Point Instructions: lg2
- 9.7.3.21. Floating Point Instructions: ex2
- 9.7.3.22. Floating Point Instructions: tanh
-
9.7.4. Half Precision Floating-Point Instructions
- 9.7.4.1. Half Precision Floating Point Instructions: add
- 9.7.4.2. Half Precision Floating Point Instructions: sub
- 9.7.4.3. Half Precision Floating Point Instructions: mul
- 9.7.4.4. Half Precision Floating Point Instructions: fma
- 9.7.4.5. Half Precision Floating Point Instructions: neg
- 9.7.4.6. Half Precision Floating Point Instructions: abs
- 9.7.4.7. Half Precision Floating Point Instructions: min
- 9.7.4.8. Half Precision Floating Point Instructions: max
- 9.7.4.9. Half Precision Floating Point Instructions: tanh
- 9.7.4.10. Half Precision Floating Point Instructions: ex2
- 9.7.5. Comparison and Selection Instructions
- 9.7.6. Half Precision Comparison Instructions
-
9.7.7. Logic and Shift Instructions
- 9.7.7.1. Logic and Shift Instructions: and
- 9.7.7.2. Logic and Shift Instructions: or
- 9.7.7.3. Logic and Shift Instructions: xor
- 9.7.7.4. Logic and Shift Instructions: not
- 9.7.7.5. Logic and Shift Instructions: cnot
- 9.7.7.6. Logic and Shift Instructions: lop3
- 9.7.7.7. Logic and Shift Instructions: shf
- 9.7.7.8. Logic and Shift Instructions: shl
- 9.7.7.9. Logic and Shift Instructions: shr
-
9.7.8. Data Movement and Conversion Instructions
- 9.7.8.1. Cache Operators
- 9.7.8.2. Cache Eviction Priority Hints
- 9.7.8.3. Data Movement and Conversion Instructions: mov
- 9.7.8.4. Data Movement and Conversion Instructions: mov
- 9.7.8.5. Data Movement and Conversion Instructions: shfl (deprecated)
- 9.7.8.6. Data Movement and Conversion Instructions: shfl.sync
- 9.7.8.7. Data Movement and Conversion Instructions: prmt
- 9.7.8.8. Data Movement and Conversion Instructions: ld
- 9.7.8.9. Data Movement and Conversion Instructions: ld.global.nc
- 9.7.8.10. Data Movement and Conversion Instructions: ldu
- 9.7.8.11. Data Movement and Conversion Instructions: st
- 9.7.8.12. Data Movement and Conversion Instructions: st.async
- 9.7.8.13. Data Movement and Conversion Instructions: multimem.ld_reduce, multimem.st, multimem.red
- 9.7.8.14. Data Movement and Conversion Instructions: prefetch, prefetchu
- 9.7.8.15. Data Movement and Conversion Instructions: applypriority
- 9.7.8.16. Data Movement and Conversion Instructions: discard
- 9.7.8.17. Data Movement and Conversion Instructions: createpolicy
- 9.7.8.18. Data Movement and Conversion Instructions: isspacep
- 9.7.8.19. Data Movement and Conversion Instructions: cvta
- 9.7.8.20. Data Movement and Conversion Instructions: cvt
- 9.7.8.21. Data Movement and Conversion Instructions: cvt.pack
- 9.7.8.22. Data Movement and Conversion Instructions: mapa
- 9.7.8.23. Data Movement and Conversion Instructions: getctarank
-
9.7.8.24. Data Movement and Conversion Instructions: Asynchronous copy
- 9.7.8.24.1. Completion Mechanisms for Asynchronous Copy Operations
- 9.7.8.24.2. Async Proxy
- 9.7.8.24.3. Data Movement and Conversion Instructions: cp.async
- 9.7.8.24.4. Data Movement and Conversion Instructions: cp.async.commit_group
- 9.7.8.24.5. Data Movement and Conversion Instructions: cp.async.wait_group / cp.async.wait_all
- 9.7.8.24.6. Data Movement and Conversion Instructions: cp.async.bulk
- 9.7.8.24.7. Data Movement and Conversion Instructions: cp.reduce.async.bulk
- 9.7.8.24.8. Data Movement and Conversion Instructions: cp.async.bulk.prefetch
- 9.7.8.24.9. Data Movement and Conversion Instructions: cp.async.bulk.tensor
- 9.7.8.24.10. Data Movement and Conversion Instructions: cp.reduce.async.bulk.tensor
- 9.7.8.24.11. Data Movement and Conversion Instructions: cp.async.bulk.prefetch.tensor
- 9.7.8.24.12. Data Movement and Conversion Instructions: cp.async.bulk.commit_group
- 9.7.8.24.13. Data Movement and Conversion Instructions: cp.async.bulk.wait_group
- 9.7.8.25. Data Movement and Conversion Instructions: tensormap.replace
- 9.7.9. Texture Instructions
- 9.7.10. Surface Instructions
- 9.7.11. Control Flow Instructions
-
9.7.12. Parallel Synchronization and Communication Instructions
- 9.7.12.1. Parallel Synchronization and Communication Instructions: bar, barrier
- 9.7.12.2. Parallel Synchronization and Communication Instructions: bar.warp.sync
- 9.7.12.3. Parallel Synchronization and Communication Instructions: barrier.cluster
- 9.7.12.4. Parallel Synchronization and Communication Instructions: membar/fence
- 9.7.12.5. Parallel Synchronization and Communication Instructions: atom
- 9.7.12.6. Parallel Synchronization and Communication Instructions: red
- 9.7.12.7. Parallel Synchronization and Communication Instructions: red.async
- 9.7.12.8. Parallel Synchronization and Communication Instructions: vote (deprecated)
- 9.7.12.9. Parallel Synchronization and Communication Instructions: vote.sync
- 9.7.12.10. Parallel Synchronization and Communication Instructions: match.sync
- 9.7.12.11. Parallel Synchronization and Communication Instructions: activemask
- 9.7.12.12. Parallel Synchronization and Communication Instructions: redux.sync
- 9.7.12.13. Parallel Synchronization and Communication Instructions: griddepcontrol
- 9.7.12.14. Parallel Synchronization and Communication Instructions: elect.sync
-
9.7.12.15. Parallel Synchronization and Communication Instructions: mbarrier
- 9.7.12.15.1. Size and alignment of mbarrier object
- 9.7.12.15.2. Contents of the mbarrier object
- 9.7.12.15.3. Lifecycle of the mbarrier object
- 9.7.12.15.4. Phase of the mbarrier object
- 9.7.12.15.5. Tracking asynchronous operations by the mbarrier object
- 9.7.12.15.6. Phase Completion of the mbarrier object
- 9.7.12.15.7. Arrive-on operation on mbarrier object
- 9.7.12.15.8. mbarrier support with shared memory
- 9.7.12.15.9. Parallel Synchronization and Communication Instructions: mbarrier.init
- 9.7.12.15.10. Parallel Synchronization and Communication Instructions: mbarrier.inval
- 9.7.12.15.11. Parallel Synchronization and Communication Instructions: mbarrier.expect_tx
- 9.7.12.15.12. Parallel Synchronization and Communication Instructions: mbarrier.complete_tx
- 9.7.12.15.13. Parallel Synchronization and Communication Instructions: mbarrier.arrive
- 9.7.12.15.14. Parallel Synchronization and Communication Instructions: mbarrier.arrive_drop
- 9.7.12.15.15. Parallel Synchronization and Communication Instructions: cp.async.mbarrier.arrive
- 9.7.12.15.16. Parallel Synchronization and Communication Instructions: mbarrier.test_wait/mbarrier.try_wait
- 9.7.12.15.17. Parallel Synchronization and Communication Instructions: mbarrier.pending_count
- 9.7.12.15.18. Parallel Synchronization and Communication Instructions: tensormap.cp_fenceproxy
-
9.7.13. Warp Level Matrix Multiply-Accumulate Instructions
- 9.7.13.1. Matrix Shape
- 9.7.13.2. Matrix Data-types
- 9.7.13.3. Matrix multiply-accumulate operation using wmma instructions
-
9.7.13.4. Matrix multiply-accumulate operation using mma instruction
- 9.7.13.4.1. Matrix Fragments for mma.m8n8k4 with .f16 floating point type
- 9.7.13.4.2. Matrix Fragments for mma.m8n8k4 with .f64 floating point type
- 9.7.13.4.3. Matrix Fragments for mma.m8n8k16
- 9.7.13.4.4. Matrix Fragments for mma.m8n8k32
- 9.7.13.4.5. Matrix Fragments for mma.m8n8k128
- 9.7.13.4.6. Matrix Fragments for mma.m16n8k4
- 9.7.13.4.7. Matrix Fragments for mma.m16n8k8
- 9.7.13.4.8. Matrix Fragments for mma.m16n8k16 with floating point type
- 9.7.13.4.9. Matrix Fragments for mma.m16n8k16 with integer type
- 9.7.13.4.10. Matrix Fragments for mma.m16n8k32
- 9.7.13.4.11. Matrix Fragments for mma.m16n8k64
- 9.7.13.4.12. Matrix Fragments for mma.m16n8k128
- 9.7.13.4.13. Matrix Fragments for mma.m16n8k256
- 9.7.13.4.14. Multiply-and-Accumulate Instruction: mma
- 9.7.13.4.15. Warp-level matrix load instruction: ldmatrix
- 9.7.13.4.16. Warp-level matrix store instruction: stmatrix
- 9.7.13.4.17. Warp-level matrix transpose instruction: movmatrix
-
9.7.13.5. Matrix multiply-accumulate operation using mma.sp instruction with sparse matrix A
- 9.7.13.5.1. Sparse matrix storage
-
9.7.13.5.2. Matrix fragments for multiply-accumulate operation with sparse matrix A
- 9.7.13.5.2.1. Matrix Fragments for sparse mma.m16n8k16 with .f16 and .bf16 types
- 9.7.13.5.2.2. Matrix Fragments for sparse mma.m16n8k32 with .f16 and .bf16 types
- 9.7.13.5.2.3. Matrix Fragments for sparse mma.m16n8k16 with .tf32 floating point type
- 9.7.13.5.2.4. Matrix Fragments for sparse mma.m16n8k8 with .tf32 floating point type
- 9.7.13.5.2.5. Matrix Fragments for sparse mma.m16n8k32 with .u8/.s8 integer type
- 9.7.13.5.2.6. Matrix Fragments for sparse mma.m16n8k64 with .u8/.s8/.e4m3/.e5m2 type
- 9.7.13.5.2.7. Matrix Fragments for sparse mma.m16n8k64 with .u4/.s4 integer type
- 9.7.13.5.2.8. Matrix Fragments for sparse mma.m16n8k128 with .u4/.s4 integer type
- 9.7.13.5.3. Multiply-and-Accumulate Instruction: mma.sp/mma.sp::ordered_metadata
-
9.7.14. Asynchronous Warpgroup Level Matrix Multiply-Accumulate Instructions
- 9.7.14.1. Warpgroup
- 9.7.14.2. Matrix Shape
- 9.7.14.3. Matrix Data-types
- 9.7.14.4. Async Proxy
-
9.7.14.5. Asynchronous Warpgroup Level Matrix Multiply-Accumulate Operation using wgmma.mma_async instruction
-
9.7.14.5.1. Register Fragments and Shared Memory Matrix Layouts
- 9.7.14.5.1.1. Register Fragments
-
9.7.14.5.1.2. Shared Memory Matrix Layout
- 9.7.14.5.1.2.1. Shared Memory Layout for wgmma.mma_async.m64nNk16
- 9.7.14.5.1.2.2. Shared Memory Layout for wgmma.mma_async.m64nNk8
- 9.7.14.5.1.2.3. Shared Memory Layout for wgmma.mma_async.m64nNk32
- 9.7.14.5.1.2.4. Shared Memory Layout for wgmma.mma_async.m64nNk256
- 9.7.14.5.1.2.5. Strides
- 9.7.14.5.1.2.6. Swizzling Modes
- 9.7.14.5.1.2.7. Matrix Descriptor Format
- 9.7.14.5.2. Asynchronous Multiply-and-Accumulate Instruction: wgmma.mma_async
-
9.7.14.5.1. Register Fragments and Shared Memory Matrix Layouts
- 9.7.14.6. Asynchronous Warpgroup Level Multiply-and-Accumulate Operation using wgmma.mma_async.sp instruction
- 9.7.14.7. Asynchronous wgmma Proxy Operations
- 9.7.15. Stack Manipulation Instructions
- 9.7.16. Video Instructions
- 9.7.17. Miscellaneous Instructions
-
9.7.1. Integer Arithmetic Instructions
-
10. Special Registers
- 10.1. Special Registers: %tid
- 10.2. Special Registers: %ntid
- 10.3. Special Registers: %laneid
- 10.4. Special Registers: %warpid
- 10.5. Special Registers: %nwarpid
- 10.6. Special Registers: %ctaid
- 10.7. Special Registers: %nctaid
- 10.8. Special Registers: %smid
- 10.9. Special Registers: %nsmid
- 10.10. Special Registers: %gridid
- 10.11. Special Registers: %is_explicit_cluster
- 10.12. Special Registers: %clusterid
- 10.13. Special Registers: %nclusterid
- 10.14. Special Registers: %cluster_ctaid
- 10.15. Special Registers: %cluster_nctaid
- 10.16. Special Registers: %cluster_ctarank
- 10.17. Special Registers: %cluster_nctarank
- 10.18. Special Registers: %lanemask_eq
- 10.19. Special Registers: %lanemask_le
- 10.20. Special Registers: %lanemask_lt
- 10.21. Special Registers: %lanemask_ge
- 10.22. Special Registers: %lanemask_gt
- 10.23. Special Registers: %clock, %clock_hi
- 10.24. Special Registers: %clock64
- 10.25. Special Registers: %pm0..%pm7
- 10.26. Special Registers: %pm0_64..%pm7_64
- 10.27. Special Registers: %envreg<32>
- 10.28. Special Registers: %globaltimer, %globaltimer_lo, %globaltimer_hi
- 10.29. Special Registers: %reserved_smem_offset_begin, %reserved_smem_offset_end, %reserved_smem_offset_cap, %reserved_smem_offset_<2>
- 10.30. Special Registers: %total_smem_size
- 10.31. Special Registers: %aggr_smem_size
- 10.32. Special Registers: %dynamic_smem_size
- 10.33. Special Registers: %current_graph_exec
-
11. Directives
- 11.1. PTX Module Directives
- 11.2. Specifying Kernel Entry Points and Functions
- 11.3. Control Flow Directives
-
11.4. Performance-Tuning Directives
- 11.4.1. Performance-Tuning Directives: .maxnreg
- 11.4.2. Performance-Tuning Directives: .maxntid
- 11.4.3. Performance-Tuning Directives: .reqntid
- 11.4.4. Performance-Tuning Directives: .minnctapersm
- 11.4.5. Performance-Tuning Directives: .maxnctapersm (deprecated)
- 11.4.6. Performance-Tuning Directives: .noreturn
- 11.4.7. Performance-Tuning Directives: .pragma
- 11.5. Debugging Directives
- 11.6. Linking Directives
- 11.7. Cluster Dimension Directives
-
12. Release Notes
- 12.1. Changes in PTX ISA Version 8.5
- 12.2. Changes in PTX ISA Version 8.4
- 12.3. Changes in PTX ISA Version 8.3
- 12.4. Changes in PTX ISA Version 8.2
- 12.5. Changes in PTX ISA Version 8.1
- 12.6. Changes in PTX ISA Version 8.0
- 12.7. Changes in PTX ISA Version 7.8
- 12.8. Changes in PTX ISA Version 7.7
- 12.9. Changes in PTX ISA Version 7.6
- 12.10. Changes in PTX ISA Version 7.5
- 12.11. Changes in PTX ISA Version 7.4
- 12.12. Changes in PTX ISA Version 7.3
- 12.13. Changes in PTX ISA Version 7.2
- 12.14. Changes in PTX ISA Version 7.1
- 12.15. Changes in PTX ISA Version 7.0
- 12.16. Changes in PTX ISA Version 6.5
- 12.17. Changes in PTX ISA Version 6.4
- 12.18. Changes in PTX ISA Version 6.3
- 12.19. Changes in PTX ISA Version 6.2
- 12.20. Changes in PTX ISA Version 6.1
- 12.21. Changes in PTX ISA Version 6.0
- 12.22. Changes in PTX ISA Version 5.0
- 12.23. Changes in PTX ISA Version 4.3
- 12.24. Changes in PTX ISA Version 4.2
- 12.25. Changes in PTX ISA Version 4.1
- 12.26. Changes in PTX ISA Version 4.0
- 12.27. Changes in PTX ISA Version 3.2
- 12.28. Changes in PTX ISA Version 3.1
- 12.29. Changes in PTX ISA Version 3.0
- 12.30. Changes in PTX ISA Version 2.3
- 12.31. Changes in PTX ISA Version 2.2
- 12.32. Changes in PTX ISA Version 2.1
- 12.33. Changes in PTX ISA Version 2.0
- 14. Descriptions of .pragma Strings
- 15. Notices