cuPQC-PK Usage#

The first step in using the cuPQC-PK library is defining the post-quantum cryptographic operation you wish to perform. We specify operations by combining cuPQC-PK operators to form a cuPQC-PK descriptor. A well-defined cuPQC-PK descriptor must include the algorithm, security category, cryptographic function, and block dimension.

As an example, we’ll construct a descriptor for ML-KEM-512 key generation.

// Include the cuPQC-PK library header
#include <pk.hpp>

//cuPQC-PK descriptor
//Define ML-KEM-512 Key Generation with 128 threads
using MLKEM512Key = decltype(ML_KEM_512()
                           + Function<function::Keygen>()
                           + Block()
                           + BlockDim<128>());

Executing Operations#

Descriptors are instantiated as objects inside your CUDA kernels. Creating the descriptor object has no computational cost—it serves as a configuration handle. The descriptor provides an execute(...) method that performs the requested cryptographic operation. The arguments for this function depend on the algorithm and function; see the device functions API reference for details.

To demonstrate cuPQC-PK execution we will continue with our previous example and write a kernel for a batched key generation operation – each thread block will compute one public and secret key:

// Include the cuPQC-PK library header
#include <pk.hpp>

//cuPQC-PK descriptor
//Define ML-KEM-512 Key Generation with 128 threads
using MLKEM512Key = decltype(ML_KEM_512()
                           + Function<function::Keygen>()
                           + Block()
                           + BlockDim<128>());

__global__ void keygen_kernel(uint8_t* public_keys, uint8_t* secret_keys, uint8_t* workspace, uint8_t* randombytes)
{
   // These functions require a fixed amount of shared memory, for which the API provides a trait
    __shared__ uint8_t smem_ptr[MLKEM512Key::shared_memory_size];
    int block = blockIdx.x;
    auto public_key = public_keys + block * MLKEM512Key::public_key_size;
    auto secret_key = secret_keys + block * MLKEM512Key::secret_key_size;
    auto entropy    = randombytes + block * MLKEM512Key::entropy_size;
    auto work       = workspace   + block * MLKEM512Key::workspace_size;

    // Execute the key generation
    MLKEM512Key().execute(public_key, secret_key, entropy, work, smem_ptr);
}

In keygen_kernel, each block computes a single public_key and secret_key. Many cuPQC-PK algorithms require an additional global memory workspace to be constructed, as well as a pre-filled buffer of cryptographic entropy. Techniques to allocate these variables are shown in the next section.

Launching Kernels#

To launch a kernel with the cuPQC-PK API inside, we need to know the block size and the required amount of global memory workspace. These are fixed by the cuPQC-PK descriptor that we constructed earlier. In the host code we have API function calls that are templated from the cuPQC-PK descriptor, these API calls give the global memory workspace, as well as the entropy variables.

void ml_kem_keygen(std::vector<uint8_t> public_keys, std::vector<uint8_t> secret_keys, const unsigned int batch)
{
    /*
     * Set up for utilizing cuPQC-PK ML-KEM Keygen.
     * Allocates device workspace for computing
     */
    auto length_public_key = MLKEM512Key::public_key_size;
    auto length_secret_key = MLKEM512Key::secret_key_size;

    auto workspace         = make_workspace<MLKEM512Key>(batch);
    auto randombytes       = get_entropy<MLKEM512Key>(batch);
    /*
     * Allocate device memory for public and secret keys
     */
    uint8_t* d_public_key = nullptr;
    uint8_t* d_secret_key = nullptr;

    cudaMalloc(&d_public_key, length_public_key * batch); //These are uint8_t so length and batch are in bytes
    cudaMalloc(&d_secret_key, length_secret_key * batch);

    // Launch the kernel
    keygen_kernel<<<batch, MLKEM512Key::BlockDim>>>(d_public_key, d_secret_key, workspace, randombytes);

    /*
     * Transfer generated keys to the host for communication or storage
     */
    cudaMemcpy(public_keys.data(), d_public_key, length_public_key * batch, cudaMemcpyDeviceToHost);
    cudaMemcpy(secret_keys.data(), d_secret_key, length_secret_key * batch, cudaMemcpyDeviceToHost);

    /*
     * Delete device memory associated with the cryptographic process.
     */
    cudaFree(d_public_key);
    cudaFree(d_secret_key);
    destroy_workspace(workspace);
    release_entropy(randombytes);
}

Compilation#

In order to compile a program which includes pk.hpp, as well as the cuPQC-PK API, users need to pass the location of the cuPQC-PK header, as well as link with the static library provided. Details on how to use cuPQC-PK in your project can be found in the Getting Started guide. The cuPQC-PK library is built using link-time-optimization (LTO), so the appropriate flags are required when compiling to your specific architecture.

nvcc -std=c++17 -dlto -arch=sm_80 \
     -I<path_to_cupqc_sdk_include> -I<path_to_commondx> \
     -L<path_to_static_lib> -lcupqc-pk \
     cupqc_pk_intro_example.cu -o cupqc_pk_intro_example

For detailed installation and compilation instructions, see the Getting Started guide. For information on supported features, see SDK Features.