PTX Compiler APIs

The User guide to PTX Compiler APIs.

1. Introduction

The PTX Compiler APIs are a set of APIs which can be used to compile a PTX program into GPU assembly code.

The APIs accept PTX programs in character string form and create handles to the compiler that can be used to obtain the GPU assembly code. The GPU assembly code string generated by the APIs can be loaded by cuModuleLoadData and cuModuleLoadDataEx, and linked with other modules by cuLinkAddData of the CUDA Driver API or nvJitLinkAddData API from nvjitlink.

The main use cases for these PTX Compiler APIs are:

With CUDA driver APIs, compilation and loading are tied together. PTX Compiler APIs de-couple the two operations. This allows applications to perform early compilation and caching of the GPU assembly code.
PTX Compiler APIs allow users to use runtime compilation for the latest PTX version that is supported as part of CUDA Toolkit release. This support may not be available in the PTX JIT compiler present in the CUDA Driver if the application is running with an older driver installed in the system. Refer to CUDA Compatibility for more details.
With PTX Compiler APIs, clients can implement a custom caching mechanism with the compiled GPU assembly. With CUDA driver, there is no control over caching of the JIT compilation results.
The clients get fine grain control and can specify the compiler options during compilation.

2. Getting Started

2.1. System Requirements

PTX Compiler library requires the following system configuration:

POSIX threads support for non-Windows platform.
GPU: Any GPU with CUDA Compute Capability 5.0 or higher.
CUDA Toolkit and Driver.

2.2. Installation

PTX Compiler library is part of the CUDA Toolkit release and the components are organized as follows in the CUDA toolkit installation directory:

On Windows:
- include\nvPTXCompiler.h
- lib\x64\nvptxcompiler_static.lib
- doc\pdf\PTX_Compiler_API_User_Guide.pdf
On Linux:
- include/nvPTXCompiler.h
- lib64/libnvptxcompiler_static.a
- doc/pdf/PTX_Compiler_API_User_Guide.pdf

3. Thread Safety

All PTX Compiler API functions are thread safe and may be invoked by multiple threads concurrently.

4. User Interface

This chapter presents the PTX Compiler APIs. Basic usage of the API is explained in Basic Usage.

PTX-compiler handle
Error codes
API Versioning
Compilation APIs

4.1. PTX-Compiler Handle

Typedefs

nvPTXCompilerHandle: nvPTXCompilerHandle represents a handle to the PTX Compiler.

4.1.1. Typedefs

typedef struct nvPTXCompiler *nvPTXCompilerHandle

nvPTXCompilerHandle represents a handle to the PTX Compiler.

To compile a PTX program string, an instance of nvPTXCompiler must be created and the handle to it must be obtained using the API nvPTXCompilerCreate(). Then the compilation can be done using the API nvPTXCompilerCompile().

4.2. Error codes

Enumerations

nvPTXCompileResult: The nvPTXCompiler APIs return the nvPTXCompileResult codes to indicate the call result.

4.2.1. Enumerations

enum nvPTXCompileResult

The nvPTXCompiler APIs return the nvPTXCompileResult codes to indicate the call result.

Values:

enumerator NVPTXCOMPILE_SUCCESS

enumerator NVPTXCOMPILE_ERROR_INVALID_COMPILER_HANDLE

enumerator NVPTXCOMPILE_ERROR_INVALID_INPUT

enumerator NVPTXCOMPILE_ERROR_COMPILATION_FAILURE

enumerator NVPTXCOMPILE_ERROR_INTERNAL

enumerator NVPTXCOMPILE_ERROR_OUT_OF_MEMORY

enumerator NVPTXCOMPILE_ERROR_COMPILER_INVOCATION_INCOMPLETE

enumerator NVPTXCOMPILE_ERROR_UNSUPPORTED_PTX_VERSION

enumerator NVPTXCOMPILE_ERROR_UNSUPPORTED_DEVSIDE_SYNC

enumerator NVPTXCOMPILE_ERROR_CANCELLED

enumerator NVPTXCOMPILE_PARSE_ONLY_SUCCESS

4.3. API Versioning

The PTX compiler APIs are versioned so that any new features or API changes can be done by bumping up the API version.

Functions

nvPTXCompileResult nvPTXCompilerGetNumSupportedArchs(int *numArchs): Sets the output paramater numArchs with the number of architectures supported by PTX Compiler Library.
nvPTXCompileResult nvPTXCompilerGetNumSupportedVersions(int *numVersions): Sets the output paramater numVersions with the number of PTX ISA versions supported by PTX Compiler Library.
nvPTXCompileResult nvPTXCompilerGetSupportedArchs(int *supportedArchs): Populates the array passed via output paramater supportedArchs with architectures supported by PTX Compiler Library.
nvPTXCompileResult nvPTXCompilerGetSupportedVersions(int(*versions)[2]): Populates the 2D Array passed via output paramater versions with PTX ISA Versions major, minor pair supported by PTX Compiler Library.
nvPTXCompileResult nvPTXCompilerGetVersion(unsigned int *major, unsigned int *minor): Queries the current major and minor version of PTX Compiler APIs being used.

4.3.1. Functions

nvPTXCompileResult nvPTXCompilerGetNumSupportedArchs(int *numArchs)

Sets the output paramater numArchs with the number of architectures supported by PTX Compiler Library.

This can be used to pass an array to nvPTXCompilerGetSupportedArchs to get the supported architectures.

Parameters

numArchs – [out] Number of supported architectures

Returns

NVPTXCOMPILE_SUCCESS
NVPTXCOMPILE_ERROR_INVALID_INPUT

nvPTXCompileResult nvPTXCompilerGetNumSupportedVersions(int *numVersions)

Sets the output paramater numVersions with the number of PTX ISA versions supported by PTX Compiler Library.

This can be used to pass a 2D Array to nvPTXCompilerGetSupportedVersions. Number of rows in 2D Array will be equal to value populated in numVersions.

Parameters

numVersions – [out] Number of supported PTX ISA versions

Returns

NVPTXCOMPILE_SUCCESS
NVPTXCOMPILE_ERROR_INVALID_INPUT

nvPTXCompileResult nvPTXCompilerGetSupportedArchs(int *supportedArchs)

Populates the array passed via output paramater supportedArchs with architectures supported by PTX Compiler Library.

The array is sorted in the ascending order. The size of the array to be passed can be determined using nvPTXCompilerGetNumSupportedArchs.

Parameters

supportedArchs – [out] Sorted array of supported architectures.

Returns

NVPTXCOMPILE_SUCCESS
NVPTXCOMPILE_ERROR_INVALID_INPUT

nvPTXCompileResult nvPTXCompilerGetSupportedVersions(int (*versions)[2])

Populates the 2D Array passed via output paramater versions with PTX ISA Versions major, minor pair supported by PTX Compiler Library.

The array is sorted in the ascending order. the row size of the 2D array can be determined using nvPTXCompilerGetNumSupportedVersions.

Parameters

versions – [out] Sorted 2D array with column size 2, having pair of major and minor PTX ISA version supported.

Returns

NVPTXCOMPILE_SUCCESS
NVPTXCOMPILE_ERROR_INVALID_INPUT

nvPTXCompileResult nvPTXCompilerGetVersion(unsigned int *major, unsigned int *minor)

Queries the current major and minor version of PTX Compiler APIs being used.

Note

The version of PTX Compiler APIs follows the CUDA Toolkit versioning. The PTX ISA version supported by a PTX Compiler API version is listed here.

Parameters

major – [out] Major version of the PTX Compiler APIs
minor – [out] Minor version of the PTX Compiler APIs

Returns

NVPTXCOMPILE_SUCCESS
NVPTXCOMPILE_ERROR_INTERNAL

4.4. Compilation APIs

Functions

nvPTXCompileResult nvPTXCompilerCompile(nvPTXCompilerHandle compiler, int numCompileOptions, const char *const *compileOptions): Compile a PTX program with the given compiler options.
nvPTXCompileResult nvPTXCompilerCreate(nvPTXCompilerHandle *compiler, size_t ptxCodeLen, const char *ptxCode): Obtains the handle to an instance of the PTX compiler initialized with the given PTX program ptxCode .
nvPTXCompileResult nvPTXCompilerDestroy(nvPTXCompilerHandle *compiler): Destroys and cleans the already created PTX compiler.
nvPTXCompileResult nvPTXCompilerGetCompiledProgram(nvPTXCompilerHandle compiler, void *binaryImage): Obtains the image of the compiled program.
nvPTXCompileResult nvPTXCompilerGetCompiledProgramSize(nvPTXCompilerHandle compiler, size_t *binaryImageSize): Obtains the size of the image of the compiled program.
nvPTXCompileResult nvPTXCompilerGetErrorLog(nvPTXCompilerHandle compiler, char *errorLog): Query the error message that was seen previously for the handle.
nvPTXCompileResult nvPTXCompilerGetErrorLogSize(nvPTXCompilerHandle compiler, size_t *errorLogSize): Query the size of the error message that was seen previously for the handle.
nvPTXCompileResult nvPTXCompilerGetInfoLog(nvPTXCompilerHandle compiler, char *infoLog): Query the information message that was seen previously for the handle.
nvPTXCompileResult nvPTXCompilerGetInfoLogSize(nvPTXCompilerHandle compiler, size_t *infoLogSize): Query the size of the information message that was seen previously for the handle.
nvPTXCompileResult nvPTXCompilerSetFlowCallback(nvPTXCompilerHandle compiler, int(*callback)(void *, void *), void *payload): Register a callback function that the compiler will invoke at different phases of PTX Compilation during a call to nvPTXCompilerCompile() .

4.4.1. Functions

nvPTXCompileResult nvPTXCompilerCompile(nvPTXCompilerHandle compiler, int numCompileOptions, const char *const *compileOptions)

Compile a PTX program with the given compiler options.

Note

—gpu-name (-arch) is a mandatory option.

Parameters

compiler – [inout] A handle to PTX compiler initialized with the PTX program which is to be compiled. The compiled program can be accessed using the handle
numCompileOptions – [in] Length of the array compileOptions
compileOptions – [in] Compiler options with which compilation should be done. The compiler options string is a null terminated character array. A valid list of compiler options is at link.

Returns

NVPTXCOMPILE_SUCCESS
NVPTXCOMPILE_ERROR_OUT_OF_MEMORY
NVPTXCOMPILE_ERROR_INTERNAL
NVPTXCOMPILE_ERROR_INVALID_COMPILER_HANDLE
NVPTXCOMPILE_ERROR_COMPILATION_FAILURE
NVPTXCOMPILE_ERROR_UNSUPPORTED_PTX_VERSION
NVPTXCOMPILE_ERROR_UNSUPPORTED_DEVSIDE_SYNC

nvPTXCompileResult nvPTXCompilerCreate(nvPTXCompilerHandle *compiler, size_t ptxCodeLen, const char *ptxCode)

Obtains the handle to an instance of the PTX compiler initialized with the given PTX program ptxCode.

Parameters

compiler – [out] Returns a handle to PTX compiler initialized with the PTX program ptxCode
ptxCodeLen – [in] Size of the PTX program ptxCode passed as string
ptxCode – [in] The PTX program which is to be compiled passed as string.

Returns

NVPTXCOMPILE_SUCCESS
NVPTXCOMPILE_ERROR_OUT_OF_MEMORY
NVPTXCOMPILE_ERROR_INTERNAL

nvPTXCompileResult nvPTXCompilerDestroy(nvPTXCompilerHandle *compiler)

Destroys and cleans the already created PTX compiler.

Parameters

compiler – [in] A handle to the PTX compiler which is to be destroyed

Returns

NVPTXCOMPILE_SUCCESS
NVPTXCOMPILE_ERROR_OUT_OF_MEMORY
NVPTXCOMPILE_ERROR_INTERNAL
NVPTXCOMPILE_ERROR_INVALID_COMPILER_HANDLE

nvPTXCompileResult nvPTXCompilerGetCompiledProgram(nvPTXCompilerHandle compiler, void *binaryImage)

Obtains the image of the compiled program.

Note

nvPTXCompilerCompile() API should be invoked for the handle before calling this API. Otherwise, NVPTXCOMPILE_ERROR_COMPILER_INVOCATION_INCOMPLETE is returned.

Parameters

compiler – [in] A handle to PTX compiler on which nvPTXCompilerCompile() has been performed.
binaryImage – [out] The image of the compiled program. Client should allocate memory for binaryImage

Returns

NVPTXCOMPILE_SUCCESS
NVPTXCOMPILE_ERROR_INTERNAL
NVPTXCOMPILE_ERROR_INVALID_COMPILER_HANDLE
NVPTXCOMPILE_ERROR_COMPILER_INVOCATION_INCOMPLETE

nvPTXCompileResult nvPTXCompilerGetCompiledProgramSize(nvPTXCompilerHandle compiler, size_t *binaryImageSize)

Obtains the size of the image of the compiled program.

Note

nvPTXCompilerCompile() API should be invoked for the handle before calling this API. Otherwise, NVPTXCOMPILE_ERROR_COMPILER_INVOCATION_INCOMPLETE is returned.

Parameters

compiler – [in] A handle to PTX compiler on which nvPTXCompilerCompile() has been performed.
binaryImageSize – [out] The size of the image of the compiled program

Returns

NVPTXCOMPILE_SUCCESS
NVPTXCOMPILE_ERROR_INTERNAL
NVPTXCOMPILE_ERROR_INVALID_COMPILER_HANDLE
NVPTXCOMPILE_ERROR_COMPILER_INVOCATION_INCOMPLETE

nvPTXCompileResult nvPTXCompilerGetErrorLog(nvPTXCompilerHandle compiler, char *errorLog)

Query the error message that was seen previously for the handle.

Parameters

compiler – [in] A handle to PTX compiler on which nvPTXCompilerCompile() has been performed.
errorLog – [out] The error log which was produced in previous call to nvPTXCompilerCompiler(). Clients should allocate memory for errorLog

Returns

NVPTXCOMPILE_SUCCESS
NVPTXCOMPILE_ERROR_INTERNAL
NVPTXCOMPILE_ERROR_INVALID_COMPILER_HANDLE

nvPTXCompileResult nvPTXCompilerGetErrorLogSize(nvPTXCompilerHandle compiler, size_t *errorLogSize)

Query the size of the error message that was seen previously for the handle.

Parameters

compiler – [in] A handle to PTX compiler on which nvPTXCompilerCompile() has been performed.
errorLogSize – [out] The size of the error log in bytes which was produced in previous call to nvPTXCompilerCompiler().

Returns

NVPTXCOMPILE_SUCCESS
NVPTXCOMPILE_ERROR_INTERNAL
NVPTXCOMPILE_ERROR_INVALID_COMPILER_HANDLE

nvPTXCompileResult nvPTXCompilerGetInfoLog(nvPTXCompilerHandle compiler, char *infoLog)

Query the information message that was seen previously for the handle.

Parameters

compiler – [in] A handle to PTX compiler on which nvPTXCompilerCompile() has been performed.
infoLog – [out] The information log which was produced in previous call to nvPTXCompilerCompiler(). Clients should allocate memory for infoLog

Returns

NVPTXCOMPILE_SUCCESS
NVPTXCOMPILE_ERROR_INTERNAL
NVPTXCOMPILE_ERROR_INVALID_COMPILER_HANDLE

nvPTXCompileResult nvPTXCompilerGetInfoLogSize(nvPTXCompilerHandle compiler, size_t *infoLogSize)

Query the size of the information message that was seen previously for the handle.

Parameters

compiler – [in] A handle to PTX compiler on which nvPTXCompilerCompile() has been performed.
infoLogSize – [out] The size of the information log in bytes which was produced in previous call to nvPTXCompilerCompiler().

Returns

NVPTXCOMPILE_SUCCESS
NVPTXCOMPILE_ERROR_INTERNAL
NVPTXCOMPILE_ERROR_INVALID_COMPILER_HANDLE

nvPTXCompileResult nvPTXCompilerSetFlowCallback(nvPTXCompilerHandle compiler, int (*callback)(void*, void*), void *payload)

Register a callback function that the compiler will invoke at different phases of PTX Compilation during a call to nvPTXCompilerCompile().

The callback function decides to cancel the compilation by returning specific values.

Callback function must satisfy the following constraints

(1) Its signature should be

int callback(void* param1, void* param2);

When invoking the callback, the compiler will always pass payload to param1 so that the callback may make decisions based on payload . It’ll always pass NULL to param2 for now which is reserved for future extensions.

(2) It must return 1 to cancel compilation or 0 to continue. Other return values are reserved for future use.

(3) It must return consistent values. Once it returns 1 at one point, it must return 1 in all following invocations during the current nvPTXCompilerCompile call in progress.

(4) It must be thread-safe.

(5) It must not invoke any nvrtc/libnvvm/ptx APIs.

Parameters

compiler – [in] A handle to an initialized PTX compiler in which to introduce the callback.
callback – [in] Function pointer to the callback function.
payload – [in] payload to be passed as a parameter when invoking the callback.

Returns

NVPTXCOMPILE_SUCCESS NVPTXCOMPILE_ERROR_INVALID_INPUT
NVPTXCOMPILE_ERROR_INVALID_COMPILER_HANDLE

5. Compilation Options

This chapter describes options supported by nvPTXCompilerCompile() API.

Option names with two preceding dashes (--) are long option names and option names with one preceding dash (-) are short option names. Short option names can be used instead of long option names. When a compile option takes an argument, an assignment operator (=) is used to separate the compile option argument from the compile option name, e.g., "--gpu-name=sm_80". Alternatively, the compile option name and the argument can be specified in separate strings without an assignment operator, .e.g, "--gpu-name""sm_80".

--allow-expensive-optimizations (-allow-expensive-optimizations)

Enable (disable) to allow compiler to perform expensive optimizations using maximum available resources (memory and compile-time).

If unspecified, default behavior is to enable this feature for optimization level >= O2.

--compile-as-tools-patch (-astoolspatch)

Compile patch code for CUDA tools.

Shall not be used in conjunction with -c or -ewp.

Some PTX ISA features may not be usable in this compilation mode.

--compile-only (-c)

Generate relocatable object.

--def-load-cache (-dlcm)

Default cache modifier on global/generic load.

--def-store-cache (-dscm)

Default cache modifier on global/generic store.

--device-debug (-g)

Generate debug information for device code.

--device-function-maxrregcount N (-func-maxrregcount)

When compiling with -c option, specify the maximum number of registers that device functions can use.

This option is ignored for whole-program compilation and does not affect registers used by entry functions. For device functions, this option overrides the value specified by --maxrregcount option. If neither --device-function-maxrregcount nor --maxrregcount is specified, then no maximum is assumed.

Note

Under certain situations, static device functions can safely inherit a higher register count from the caller entry function. In such cases, ptx compiler may apply the higher count for compiling the static function.

Value less than the minimum registers required by ABI will be bumped up by the compiler to ABI minimum limit.

--disable-optimizer-constants (-disable-optimizer-consts)

Disable use of optimizer constant bank.

--disable-warnings (-w)

Inhibit all warning messages.

--dont-merge-basicblocks (-no-bb-merge)

Prevents basic block merging, at a slight perfomance cost.

Normally ptx compiler attempts to merge consecutive basic blocks as part of its optimization process. However, for debuggable code this is very confusing. This option prevents merging consecutive basic blocks.

--entry entry,... (-e)

Specify the entry functions for which code must be generated.

Entry function names for this option must be specified in the mangled name.

--extensible-whole-program (-ewp)

Generate extensible whole program device code, which allows some calls to not be resolved until linking with libcudadevrt.

--fmad (-fmad)

Enables (disables) the contraction of floating-point multiplies and adds/subtracts into floating-point multiply-add operations (FMAD, FFMA, or DFMA)

Default value: true

--force-load-cache (-flcm)

Force specified cache modifier on global/generic load.

--force-store-cache (-fscm)

Force specified cache modifier on global/generic store.

--generate-line-info (-lineinfo)

Generate line-number information for device code.

--gpu-name gpuname (-arch)

Specify name of NVIDIA GPU to generate code for.

This option also takes virtual compute architectures, in which case code generation is suppressed. This can be used for parsing only.

PTX for .target sm_XY can be compiled to all GPU targets sm_MN, sm_MNa, sm_MNf where MN >= XY. PTX for .target sm_XYf can be compiled to GPU targets sm_XZ, sm_XZf, sm_XZa where Z >= Y and sm_XY and sm_XZ belong in same family. PTX with .target sm_XYa can only be compiled to GPU target sm_XYa.

Allowed Values

compute_75

compute_80

compute_86

compute_87

compute_88

compute_89

compute_90

compute_90a

compute_100

compute_100f

compute_100a

compute_103

compute_103f

compute_103a

compute_110

compute_110f

compute_110a

compute_120

compute_120f

compute_120a

compute_121

compute_121f

compute_121a

sm_75

sm_80

sm_86

sm_87

sm_88

sm_89

sm_90

sm_90a

sm_100

sm_100f

sm_100a

sm_103

sm_103f

sm_103a

sm_110

sm_110f

sm_110a

sm_120

sm_120f

sm_120a

sm_121

sm_121f

sm_121a

Default value: sm_75

--maxrregcount N (-maxrregcount)

Specify the maximum amount of registers that GPU functions can use.

Until a function-specific limit, a higher value will generally increase the performance of individual GPU threads that execute this function. However, because thread registers are allocated from a global register pool on each GPU, a higher value of this option will also reduce the maximum thread block size, thereby reducing the amount of thread parallelism. Hence, a good maxrregcount value is the result of a trade-off.

If this option is not specified, then no maximum is assumed. Value less than the minimum registers required by ABI will be bumped up by the compiler to ABI minimum limit. User program may not be able to make use of all registers as some registers are reserved by compiler.

--opt-level N (-O)

Specify optimization level.

Default value: 3.

--position-independent-code (-pic)

Generate position-independent code.

Default value:

For whole-program compilation: true.

Otherwise: false.

--preserve-relocs (-preserve-relocs)

This option will make ptx compiler to generate relocatable references for variables and preserve relocations generated for them in linked executable.

--return-at-end (-ret-end)

Prevents optimizing return instruction at end of program

Normally ptx compiler optimizes return at the end of program. However, for debuggable code this causes problems setting breakpoint at the end. This option prevents ptxas from optimizing this last return instruction.

--sanitize (-sanitize)

Generate instrumented code with specified compute-sanitizer tool.

Currently, only the value memcheck (memory check) is supported. Binary needs to be executed using compute-sanitizer at runtime, otherwise behavior is undefined. For more informations about compile time instrumentation and the compute sanitizer, consult the corresponding documentation.

--suppress-async-bulk-multicast-advisory-warning (-suppress-async-bulk-multicast-advisory-warning)

Suppress the warning on use of .multicast::cluster modifier on cp.async.bulk{.tensor} instruction with sm_90.

--suppress-stack-size-warning (-suppress-stack-size-warning)

Suppress the warning that otherwise is printed when stack size cannot be determined.

--verbose (-v)

Enable verbose mode which prints code generation statistics.

--warn-on-double-precision-use (-warn-double-usage)

Warning if double(s) are used in an instruction.

--warn-on-local-memory-usage (-warn-lmem-usage)

Warning if local memory is used.

--warn-on-spills (-warn-spills)

Warning if registers are spilled to local memory.

--warning-as-error (-Werror)

Make all warnings into errors.

--maxntid (-maxntid)

Specify the maximum number of threads that a thread block can have.

This option will be ignored if used along with -maxrregcount option. This option is also ignored for entry functions that have .maxntid directive specified.

--minnctapersm (-minnctapersm)

Specify the minimum number of CTAs to be mapped to an SM.

This option will be ignored if used along with -maxrregcount option. This option is also ignored for entry functions that have .minnctapersm directive specified.

--override-directive-values (-override-directive-values)

Override the PTX directives values by the corresponding option values.

This option is effective only for -minnctapersm, -maxntid and -maxregcount options.

--make-errors-visible-at-exit (-make-errors-visible-at-exit)

Generate required instructions at exit point to make memory faults and errors visible at exit.

--oFast-compile (-Ofc)

Specify level to prefer device code compilation speed.

Default value: 0.

--device-stack-protector (-device-stack-protector)

Enable or disable the generation of stack canaries in device code.

Stack canaries make it more difficult to exploit certain types of memory safety bugs involving stack-local variables. The compiler uses heuristics to assess the risk of such a bug in each function. Only those functions which are deemed high-risk make use of a stack canary.

--g-tensor-memory-access-check (-g-tmem-access-check)

Enable tensor memory access checks for tcgen05 operations.

This option is enabled by default with --device-debug (-g) option.

--gno-tensor-memory-access-check (-gno-tmem-access-check)

Disable tensor memory access checks for tcgen05 operations.

This option will override the --g-tensor-memory-access-check option if both are specified.

--split-compile (-split-compile)

Specify the maximum amount of concurrent threads to be utilized when running compiler optimizations.

If value specified is 1, option will be ignored. If value specified is 0, then the number of threads will be the number of CPUs on the underlying machine.

6. Basic Usage

This section of the document uses a simple example, Vector Addition, shown in Figure 1 to explain how to use PTX Compiler APIs to compile this PTX program. For brevity and readability, error checks on the API return values are not shown.

Figure 1. PTX source string for a simple vector addition

const char *ptxCode = "                                    \n \
   .version 7.0                                            \n \
   .target sm_50                                           \n \
   .address_size 64                                        \n \
   .visible .entry simpleVectorAdd(                        \n \
        .param .u64 simpleVectorAdd_param_0,               \n \
        .param .u64 simpleVectorAdd_param_1,               \n \
        .param .u64 simpleVectorAdd_param_2                \n \
   ) {                                                     \n \
        .reg .f32   %f<4>;                                 \n \
        .reg .b32   %r<5>;                                 \n \
        .reg .b64   %rd<11>;                               \n \
        ld.param.u64    %rd1, [simpleVectorAdd_param_0];   \n \
        ld.param.u64    %rd2, [simpleVectorAdd_param_1];   \n \
        ld.param.u64    %rd3, [simpleVectorAdd_param_2];   \n \
        cvta.to.global.u64      %rd4, %rd3;                \n \
        cvta.to.global.u64      %rd5, %rd2;                \n \
        cvta.to.global.u64      %rd6, %rd1;                \n \
        mov.u32         %r1, %ctaid.x;                     \n \
        mov.u32         %r2, %ntid.x;                      \n \
        mov.u32         %r3, %tid.x;                       \n \
        mad.lo.s32      %r4, %r2, %r1, %r3;                \n \
        mul.wide.u32    %rd7, %r4, 4;                      \n \
        add.s64         %rd8, %rd6, %rd7;                  \n \
        ld.global.f32   %f1, [%rd8];                       \n \
        add.s64         %rd9, %rd5, %rd7;                  \n \
        ld.global.f32   %f2, [%rd9];                       \n \
        add.f32         %f3, %f1, %f2;                     \n \
        add.s64         %rd10, %rd4, %rd7;                 \n \
        st.global.f32   [%rd10], %f3;                      \n \
        ret;                                               \n \
   } ";

The CUDA code corresponding to this PTX program would look like:

Figure 2. Equivalent CUDA source for the simple vector addition

extern "C"
 __global__ void simpleVectorAdd(float *x, float *y, float *out)
 {
     size_t tid = blockIdx.x * blockDim.x + threadIdx.x;
     out[tid] = x[tid] + y[tid];
 }

With this PTX program as a string, we can create the compiler and obtain a handle to it as shown in Figure 3.

Figure 3. Compiler creation and initialization of a program

nvPTXCompilerHandle compiler;
nvPTXCompilerCreate(&compiler, (size_t)strlen(ptxCode), ptxCode);

Compilation can now be done by specifying the compile options as shown in Figure 4.

Figure 4. Compilation of the PTX program

const char* compile_options[] = { "--gpu-name=sm_80",
                                  "--verbose"
                                };

nvPTXCompilerCompile(compiler, 2, compile_options);

The compiled GPU assembly code can now be obtained. To obtain this we first allocate memory for it. And to allocate memory, we need to query the size of the image of the compiled GPU assembly code which is done as shown in Figure 5.

Figure 5. Query size of the compiled assembly image

nvPTXCompilerGetCompiledProgramSize(compiler, &elfSize);

The image of the compiled GPU assembly code can now be queried as shown in Figure 6. This image can then be executed on the GPU by passing this image to the CUDA Driver APIs.

Figure 6. Query the compiled assembly image

elf = (char*) malloc(elfSize);
nvPTXCompilerGetCompiledProgram(compiler, (void*)elf);

When the compiler is not needed anymore, it can be destroyed as shown in Figure 7.

Figure 7. Destroy the compiler

nvPTXCompilerDestroy(&compiler);

7. Example: Simple Vector Addition

Code (simpleVectorAddition.c)

#include <stdio.h>
#include <string.h>
#include "cuda.h"
#include "nvPTXCompiler.h"

#define NUM_THREADS 128
#define NUM_BLOCKS 32
#define SIZE NUM_THREADS * NUM_BLOCKS

#define CUDA_SAFE_CALL(x)                                               \
    do {                                                                \
        CUresult result = x;                                            \
        if (result != CUDA_SUCCESS) {                                   \
            const char *msg;                                            \
            cuGetErrorName(result, &msg);                               \
            printf("error: %s failed with error %s\n", #x, msg);        \
            exit(1);                                                    \
        }                                                               \
    } while(0)

#define NVPTXCOMPILER_SAFE_CALL(x)                                       \
    do {                                                                 \
        nvPTXCompileResult result = x;                                   \
        if (result != NVPTXCOMPILE_SUCCESS) {                            \
            printf("error: %s failed with error code %d\n", #x, result); \
            exit(1);                                                     \
        }                                                                \
    } while(0)


const char *ptxCode = "                                      \
   .version 7.0                                           \n \
   .target sm_50                                          \n \
   .address_size 64                                       \n \
   .visible .entry simpleVectorAdd(                       \n \
        .param .u64 simpleVectorAdd_param_0,              \n \
        .param .u64 simpleVectorAdd_param_1,              \n \
        .param .u64 simpleVectorAdd_param_2               \n \
   ) {                                                    \n \
        .reg .f32   %f<4>;                                \n \
        .reg .b32   %r<5>;                                \n \
        .reg .b64   %rd<11>;                              \n \
        ld.param.u64    %rd1, [simpleVectorAdd_param_0];  \n \
        ld.param.u64    %rd2, [simpleVectorAdd_param_1];  \n \
        ld.param.u64    %rd3, [simpleVectorAdd_param_2];  \n \
        cvta.to.global.u64      %rd4, %rd3;               \n \
        cvta.to.global.u64      %rd5, %rd2;               \n \
        cvta.to.global.u64      %rd6, %rd1;               \n \
        mov.u32         %r1, %ctaid.x;                    \n \
        mov.u32         %r2, %ntid.x;                     \n \
        mov.u32         %r3, %tid.x;                      \n \
        mad.lo.s32      %r4, %r2, %r1, %r3;               \n \
        mul.wide.u32    %rd7, %r4, 4;                     \n \
        add.s64         %rd8, %rd6, %rd7;                 \n \
        ld.global.f32   %f1, [%rd8];                      \n \
        add.s64         %rd9, %rd5, %rd7;                 \n \
        ld.global.f32   %f2, [%rd9];                      \n \
        add.f32         %f3, %f1, %f2;                    \n \
        add.s64         %rd10, %rd4, %rd7;                \n \
        st.global.f32   [%rd10], %f3;                     \n \
        ret;                                              \n \
   } ";

int elfLoadAndKernelLaunch(void* elf, size_t elfSize)
{
    CUdevice cuDevice;
    CUcontext context;
    CUmodule module;
    CUfunction kernel;
    CUdeviceptr dX, dY, dOut;
    size_t i;
    size_t bufferSize = SIZE * sizeof(float);
    float a;
    float hX[SIZE], hY[SIZE], hOut[SIZE];
    void* args[3];

    CUDA_SAFE_CALL(cuInit(0));
    CUDA_SAFE_CALL(cuDeviceGet(&cuDevice, 0));

    CUDA_SAFE_CALL(cuCtxCreate(&context, NULL, 0, cuDevice));
    CUDA_SAFE_CALL(cuModuleLoadDataEx(&module, elf, 0, 0, 0));
    CUDA_SAFE_CALL(cuModuleGetFunction(&kernel, module, "simpleVectorAdd"));

    // Generate input for execution, and create output buffers.
    for (i = 0; i < SIZE; ++i) {
        hX[i] = (float)i;
        hY[i] = (float)i * 2;
    }
    CUDA_SAFE_CALL(cuMemAlloc(&dX,   bufferSize));
    CUDA_SAFE_CALL(cuMemAlloc(&dY,   bufferSize));
    CUDA_SAFE_CALL(cuMemAlloc(&dOut, bufferSize));

    CUDA_SAFE_CALL(cuMemcpyHtoD(dX, hX, bufferSize));
    CUDA_SAFE_CALL(cuMemcpyHtoD(dY, hY, bufferSize));

    args[0] = &dX;
    args[1] = &dY;
    args[2] = &dOut;

    CUDA_SAFE_CALL( cuLaunchKernel(kernel,
                                   NUM_BLOCKS,  1, 1, // grid dim
                                   NUM_THREADS, 1, 1, // block dim
                                   0, NULL, // shared mem and stream
                                   args, 0)); // arguments
    CUDA_SAFE_CALL(cuCtxSynchronize()); // Retrieve and print output.

    CUDA_SAFE_CALL(cuMemcpyDtoH(hOut, dOut, bufferSize));
    for (i = 0; i < SIZE; ++i) {
        printf("Result:[%ld]:%f\n", i, hOut[i]);
    }

    // Release resources.
    CUDA_SAFE_CALL(cuMemFree(dX));
    CUDA_SAFE_CALL(cuMemFree(dY));
    CUDA_SAFE_CALL(cuMemFree(dOut));
    CUDA_SAFE_CALL(cuModuleUnload(module));
    CUDA_SAFE_CALL(cuCtxDestroy(context));
    return 0;
}

int main(int _argc, char *_argv[])
{
    nvPTXCompilerHandle compiler = NULL;
    nvPTXCompileResult status;

    size_t elfSize, infoSize, errorSize;
    char *elf, *infoLog, *errorLog;
    unsigned int minorVer, majorVer;

    const char* compile_options[] = { "--gpu-name=sm_80",
                                      "--verbose"
                                    };

    NVPTXCOMPILER_SAFE_CALL(nvPTXCompilerGetVersion(&majorVer, &minorVer));
    printf("Current PTX Compiler API Version : %d.%d\n", majorVer, minorVer);

    NVPTXCOMPILER_SAFE_CALL(nvPTXCompilerCreate(&compiler,
                                                (size_t)strlen(ptxCode),  /* ptxCodeLen */
                                                ptxCode)                  /* ptxCode */
                            );

    status = nvPTXCompilerCompile(compiler,
                                  2,                 /* numCompileOptions */
                                  compile_options);  /* compileOptions */

    if (status != NVPTXCOMPILE_SUCCESS) {
        NVPTXCOMPILER_SAFE_CALL(nvPTXCompilerGetErrorLogSize(compiler, &errorSize));

        if (errorSize != 0) {
            errorLog = (char*)malloc(errorSize+1);
            NVPTXCOMPILER_SAFE_CALL(nvPTXCompilerGetErrorLog(compiler, errorLog));
            printf("Error log: %s\n", errorLog);
            free(errorLog);
        }
        exit(1);
    }

    NVPTXCOMPILER_SAFE_CALL(nvPTXCompilerGetCompiledProgramSize(compiler, &elfSize));

    elf = (char*) malloc(elfSize);
    NVPTXCOMPILER_SAFE_CALL(nvPTXCompilerGetCompiledProgram(compiler, (void*)elf));

    NVPTXCOMPILER_SAFE_CALL(nvPTXCompilerGetInfoLogSize(compiler, &infoSize));

    if (infoSize != 0) {
        infoLog = (char*)malloc(infoSize+1);
        NVPTXCOMPILER_SAFE_CALL(nvPTXCompilerGetInfoLog(compiler, infoLog));
        printf("Info log: %s\n", infoLog);
        free(infoLog);
    }

    NVPTXCOMPILER_SAFE_CALL(nvPTXCompilerDestroy(&compiler));

    // Load the compiled GPU assembly code 'elf'
    elfLoadAndKernelLaunch(elf, elfSize);

    free(elf);
    return 0;
}

7.1. Build Instruction

Assuming the environment variable CUDA_PATH points to CUDA Toolkit installation directory, build this example as:

Windows:

cl.exe simpleVectorAddition.c /FesimpleVectorAddition ^
          /I "%CUDA_PATH%"\include ^
          "%CUDA_PATH%"\lib\x64\nvptxcompiler_static.lib
          "%CUDA_PATH%"\lib\x64\cuda.lib

OR

nvcc simpleVectorAddition.c  -ccbin <PATH_TO_cl.exe>
          -I $CUDA_PATH/include -L $CUDA_PATH/lib/x64/ -lcuda  nvptxcompiler_static.lib

Linux:

gcc simpleVectorAddition.c -o simpleVectorAddition \
             -I $CUDA_PATH/include \
             -L $CUDA_PATH/lib64 \
             libnvptxcompiler_static.a -lcuda -lm -lpthread \
             -Wl,-rpath,$CUDA_PATH/lib64

7.2. Notices

7.2.1. Notice

This document is provided for information purposes only and shall not be regarded as a warranty of a certain functionality, condition, or quality of a product. NVIDIA Corporation (“NVIDIA”) makes no representations or warranties, expressed or implied, as to the accuracy or completeness of the information contained in this document and assumes no responsibility for any errors contained herein. NVIDIA shall have no liability for the consequences or use of such information or for any infringement of patents or other rights of third parties that may result from its use. This document is not a commitment to develop, release, or deliver any Material (defined below), code, or functionality.

NVIDIA reserves the right to make corrections, modifications, enhancements, improvements, and any other changes to this document, at any time without notice.

Customer should obtain the latest relevant information before placing orders and should verify that such information is current and complete.

NVIDIA products are sold subject to the NVIDIA standard terms and conditions of sale supplied at the time of order acknowledgement, unless otherwise agreed in an individual sales agreement signed by authorized representatives of NVIDIA and customer (“Terms of Sale”). NVIDIA hereby expressly objects to applying any customer general terms and conditions with regards to the purchase of the NVIDIA product referenced in this document. No contractual obligations are formed either directly or indirectly by this document.

NVIDIA products are not designed, authorized, or warranted to be suitable for use in medical, military, aircraft, space, or life support equipment, nor in applications where failure or malfunction of the NVIDIA product can reasonably be expected to result in personal injury, death, or property or environmental damage. NVIDIA accepts no liability for inclusion and/or use of NVIDIA products in such equipment or applications and therefore such inclusion and/or use is at customer’s own risk.

NVIDIA makes no representation or warranty that products based on this document will be suitable for any specified use. Testing of all parameters of each product is not necessarily performed by NVIDIA. It is customer’s sole responsibility to evaluate and determine the applicability of any information contained in this document, ensure the product is suitable and fit for the application planned by customer, and perform the necessary testing for the application in order to avoid a default of the application or the product. Weaknesses in customer’s product designs may affect the quality and reliability of the NVIDIA product and may result in additional or different conditions and/or requirements beyond those contained in this document. NVIDIA accepts no liability related to any default, damage, costs, or problem which may be based on or attributable to: (i) the use of the NVIDIA product in any manner that is contrary to this document or (ii) customer product designs.

No license, either expressed or implied, is granted under any NVIDIA patent right, copyright, or other NVIDIA intellectual property right under this document. Information published by NVIDIA regarding third-party products or services does not constitute a license from NVIDIA to use such products or services or a warranty or endorsement thereof. Use of such information may require a license from a third party under the patents or other intellectual property rights of the third party, or a license from NVIDIA under the patents or other intellectual property rights of NVIDIA.

Reproduction of information in this document is permissible only if approved in advance by NVIDIA in writing, reproduced without alteration and in full compliance with all applicable export laws and regulations, and accompanied by all associated conditions, limitations, and notices.

THIS DOCUMENT AND ALL NVIDIA DESIGN SPECIFICATIONS, REFERENCE BOARDS, FILES, DRAWINGS, DIAGNOSTICS, LISTS, AND OTHER DOCUMENTS (TOGETHER AND SEPARATELY, “MATERIALS”) ARE BEING PROVIDED “AS IS.” NVIDIA MAKES NO WARRANTIES, EXPRESSED, IMPLIED, STATUTORY, OR OTHERWISE WITH RESPECT TO THE MATERIALS, AND EXPRESSLY DISCLAIMS ALL IMPLIED WARRANTIES OF NONINFRINGEMENT, MERCHANTABILITY, AND FITNESS FOR A PARTICULAR PURPOSE. TO THE EXTENT NOT PROHIBITED BY LAW, IN NO EVENT WILL NVIDIA BE LIABLE FOR ANY DAMAGES, INCLUDING WITHOUT LIMITATION ANY DIRECT, INDIRECT, SPECIAL, INCIDENTAL, PUNITIVE, OR CONSEQUENTIAL DAMAGES, HOWEVER CAUSED AND REGARDLESS OF THE THEORY OF LIABILITY, ARISING OUT OF ANY USE OF THIS DOCUMENT, EVEN IF NVIDIA HAS BEEN ADVISED OF THE POSSIBILITY OF SUCH DAMAGES. Notwithstanding any damages that customer might incur for any reason whatsoever, NVIDIA’s aggregate and cumulative liability towards customer for the products described herein shall be limited in accordance with the Terms of Sale for the product.

7.2.2. OpenCL

OpenCL is a trademark of Apple Inc. used under license to the Khronos Group Inc.

7.2.3. Trademarks

NVIDIA and the NVIDIA logo are trademarks or registered trademarks of NVIDIA Corporation in the U.S. and other countries. Other company and product names may be trademarks of the respective companies with which they are associated.