cuPQC SDK Usage#

This document provides a comprehensive overview of the cuPQC SDK, the NVIDIA® CUDA® Post-Quantum Cryptography Software Development Kit. The cuPQC SDK contains two libraries: cuPQC and cuHash. cuPQC is designed to leverage the parallel processing power of NVIDIA GPUs to offer high-performance, batched, and secure implementations of post-quantum cryptographic algorithms. Specifically, this library includes two of the NIST standardized Post-Quantum Cryptographic algorithms: ML-KEM and ML-DSA. cuHash is a library that provides a set of cryptographic hash functions that are optimized for NVIDIA GPUs. In detail, cuHash provides GPU implementations of SHA-2, SHA-3, and SHAKE. This document illustrates how to utilize each library within the SDK, starting with the cuPQC library, then the cuHash library.

Basic usage of the cuPQC library#

A first step in using cuPQC is defining the PQC operation we wish to perform. We specify our operations by adding together cuPQC operators to form a PQC descriptor. A well-defined cuPQC descriptor must include the algorithm, security category, cryptographic function, and block dimension.

As an example, we illustrate how to construct a cuPQC descriptor for ML-KEM-512’s key generation function.

//cuPQC header

#include <cupqc.hpp>

//cuPQC descriptor
//ML KEM 512 Key Generation with 128 threads

using MLKEM512Key = decltype(ML_KEM_512()
                           + Function<function::Keygen>()
                           + Block()
                           + BlockDim<128>());

Executing cuPQC#

These cuPQC descriptor types are instantiated into objects inside user kernels. Forming this object has no computational cost, and should be utilized as a handle. The cuPQC descriptor object provides a compute method, execute(...) – this method performs the requested PQC function. The arguments of this method are dependent on the PQC function and the PQC algorithm, see the cupqcmethods API reference.

To demonstrate cuPQC execution we will continue with our previous example and write a kernel for a batched key generation operation – each thread block will compute one public and secret key:

//cuPQC header
#include <cupqc.hpp>

//cuPQC descriptor
// ML KEM 512 Key Generation with 128 threads
using MLKEM512Key = decltype(ML_KEM_512()
                           + Function<function::Keygen>()
                           + Block()
                           + BlockDim<128>());

__global__ void keygen_kernel(uint8_t* public_keys, uint8_t* secret_keys, uint8_t* workspace, uint8_t* randombytes)
{
   // These functions require a fixed amount of shared memory, for which the API provides a trait
    __shared__ uint8_t smem_ptr[MLKEM512Key::shared_memory_size];
    int block = blockIdx.x;
    auto public_key = public_keys + block * MLKEM512Key::public_key_size;
    auto secret_key = secret_keys + block * MLKEM512Key::secret_key_size;
    auto entropy    = randombytes + block * MLKEM512Key::entropy_size;
    auto work       = workspace   + block * MLKEM512Key::workspace_size;

    MLKEM512Key().execute(public_key, secret_key, entropy, work, smem_ptr);
}

In keygen_kernel, each block computes a single public_key and secret_key. Many cuPQC algorithms require an additional global memory workspace to be constructed, as well as a pre-filled buffer of cryptographic entropy. Techniques to allocate these variables are shown in the next section.

Launching a cuPQC kernel#

To launch a kernel with the cuPQC API inside, we need to know the block size and the required amount of global memory workspace. These are fixed by the cuPQC descriptor that we constructed earlier. In the host code we have API function calls that are templated from the cuPQC descriptor, these API calls give the global memory workspace, as well as the entropy variables.

void ml_kem_keygen(std::vector<uint8_t> public_keys, std::vector<uint8_t> secret_keys, const unsigned int batch)
{
    /*
     * Set up for utilizing cuPQC ML-KEM Keygen.
     * Allocates device workspace for computing
     */
    auto length_public_key = MLKEM512Key::public_key_size;
    auto length_secret_key = MLKEM512Key::secret_key_size;

    auto workspace         = make_workspace<MLKEM512Key>(batch);
    auto randombytes       = get_entropy<MLKEM512Key>(batch);
    /*
     * Allocate device memory for public and secret keys
     */
    uint8_t* d_public_key = nullptr;
    uint8_t* d_secret_key = nullptr;

    cudaMalloc(&d_public_key, length_public_key * batch); //These are uint8_t so length and batch are in bytes
    cudaMalloc(&d_secret_key, length_secret_key * batch);

    keygen_kernel<<<batch, MLKEM512Key::BlockDim>>>(d_public_key, d_secret_key, workspace, randombytes);

    /*
     * Transfer generated keys to the host for communication or storage
     */
    cudaMemcpy(public_keys.data(), d_public_key, length_public_key * batch, cudaMemcpyDeviceToHost);
    cudaMemcpy(secret_keys.data(), d_secret_key, length_secret_key * batch, cudaMemcpyDeviceToHost);

    /*
     * Delete device memory associated with the cryptographic process.
     */
    cudaFree(d_public_key);
    cudaFree(d_secret_key);
    destroy_workspace(workspace);
    release_entropy(randombytes);
}

Compilation#

In order to compile a program which includes cupqc.hpp, as well as the cuPQC API, users need to pass the location of the cuPQC header, as well as link with the static library provided. Details on how to use cuPQC in your project can be found in Quick Installation Guide. The cuPQC library is built using link-time-optimization (LTO), so the appropriate flags are required when compiling to your specific architecture.

nvcc -std=c++17 -dlto -arch=sm_80 -L<path_to_static_lib> -lcupqc -I<path_to_cupqc_sdk_include> -I<path_to_commondx> <cupqc_intro_example.cu> -o cupqc_intro_example

For more information on the currently supported functionality see Requirements and Functionality.

Basic usage of the cuHash library#

A first step in using cuHash is defining the hash operation we wish to perform. We specify our operations by adding together cuHash operators to form a Hash descriptor. A well-defined cuHash descriptor must include the algorithm, security category, and either thread or warp.

As an example we show how to construct a SHA-3 256 hashing object using the warp configuration, that is we construct a SHA-3 hash utililizng a full warp.

//cuHash header

#include <cuhash.hpp>

//We use the cuPQC SDK namespace.

using namespace cupqc;

//cuHash descriptor

using SHA3_256_WARP = decltype(SHA3_256() + Warp());

Executing cuHash#

These cuHash descriptor types are instantiated into objects inside user kernels. The cuHash descriptor object provides a few relevant methods, reset(), update(...), finalize(), and digest(...). The reset() method resets the hash object to its initial state. The update(...) method takes a pointer to a buffer of data and its length, and hashes the data. The finalize() method finalizes the hash object and outputs the hash of the data into a buffer. The digest(...) method takes a pointer to a buffer of data and its length, and outputs the hash of the data into the buffer. For more information on these methods, see the methods API reference.

To demonstrate cuHash execution, we will continue with our previous example and write a kernel for a batched single hashed operation – each thread block will compute hash utilizing a warp:

//cuHash header

#include <cuhash.hpp>

//We use the cuPQC SDK namespace.

using namespace cupqc;

//cuHash descriptor

using SHA3_256_WARP = decltype(SHA3_256() + Warp());

__global__ void hash_sha3_kernel(uint8_t* digest, const uint8_t* msg, size_t inbuf_len)
{
    SHA3_256_WARP hash {};
    hash.reset();
    hash.update(msg, inbuf_len);
    hash.finalize();
    hash.digest(digest, SHA3_256_WARP::digest_size);
}

Launching a cuHash kernel#

Launching a kernel with the cuHash API is straightforward. This function takes a message buffer and outputs a digest buffer, where the hashing is performed on each thread block. In this case, it makes the most sense to use blocks with 32 threads, as this is the size of a warp.

void hash_sha3(std::vector<uint8_t>& digest, std::vector<uint8_t>& msg)
{
    uint8_t* d_msg;
    uint8_t* d_digest;
    cudaMalloc(reinterpret_cast<void**>(&d_msg), msg.size());
    cudaMalloc(reinterpret_cast<void**>(&d_digest), digest.size());

    cudaMemcpy(d_msg, msg.data(), msg.size(), cudaMemcpyHostToDevice);

    hash_sha3_kernel<<<1, 32>>>(d_digest, d_msg, msg.size());

    cudaMemcpy(digest.data(), d_digest, digest.size(), cudaMemcpyDeviceToHost);

    cudaFree(d_msg);
    cudaFree(d_digest);
}

Compilation#

In order to compile a program which includes cuhash.hpp, as well the cuHash API, users need to pass the location of the cuPQC SDK headers, as well as link with the static library provided. Details on how to use cuHash in your project can be found in Quick Installation Guide. The cuHash library is built with link-time-optimization (LTO), so we need to use the appropriate flags when compiling.

nvcc -std=c++17 -dlto -arch=sm_80 -L<path_to_static_lib> -lcuhash -I<path_to_cupqc_sdk_include> -I<path_to_commondx> <cuhash_intro_example.cu> -o cuhash_intro_example

For more information on the currently supported functionality, see Requirements and Functionality.