cuPQC Library#
A first step in using cuPQC is defining the PQC operation we wish to perform. We specify our operations by adding together cuPQC operators to form a PQC descriptor. A well-defined cuPQC descriptor must include the algorithm, security category, cryptographic function, and block dimension.
As an example, we illustrate how to construct a cuPQC descriptor for ML-KEM-512’s key generation function.
//cuPQC header
#include <cupqc.hpp>
//cuPQC descriptor
//ML KEM 512 Key Generation with 128 threads
using MLKEM512Key = decltype(ML_KEM_512()
+ Function<function::Keygen>()
+ Block()
+ BlockDim<128>());
Executing cuPQC#
These cuPQC descriptor types are instantiated into objects inside user kernels. Forming this object has no computational cost, and should be utilized as a handle.
The cuPQC descriptor object provides a compute method, execute(...)
– this method performs the requested PQC function.
The arguments of this method are dependent on the PQC function and the PQC algorithm, see the cupqcmethods API reference.
To demonstrate cuPQC execution we will continue with our previous example and write a kernel for a batched key generation operation – each thread block will compute one public and secret key:
//cuPQC header
#include <cupqc.hpp>
//cuPQC descriptor
// ML KEM 512 Key Generation with 128 threads
using MLKEM512Key = decltype(ML_KEM_512()
+ Function<function::Keygen>()
+ Block()
+ BlockDim<128>());
__global__ void keygen_kernel(uint8_t* public_keys, uint8_t* secret_keys, uint8_t* workspace, uint8_t* randombytes)
{
// These functions require a fixed amount of shared memory, for which the API provides a trait
__shared__ uint8_t smem_ptr[MLKEM512Key::shared_memory_size];
int block = blockIdx.x;
auto public_key = public_keys + block * MLKEM512Key::public_key_size;
auto secret_key = secret_keys + block * MLKEM512Key::secret_key_size;
auto entropy = randombytes + block * MLKEM512Key::entropy_size;
auto work = workspace + block * MLKEM512Key::workspace_size;
MLKEM512Key().execute(public_key, secret_key, entropy, work, smem_ptr);
}
In keygen_kernel
, each block computes a single public_key
and secret_key
. Many cuPQC algorithms require an additional
global memory workspace to be constructed, as well as a pre-filled buffer of cryptographic entropy. Techniques to allocate these variables are shown
in the next section.
Launching a cuPQC kernel#
To launch a kernel with the cuPQC API inside, we need to know the block size and the required amount of global memory workspace. These are fixed by the cuPQC descriptor that we constructed earlier. In the host code we have API function calls that are templated from the cuPQC descriptor, these API calls give the global memory workspace, as well as the entropy variables.
void ml_kem_keygen(std::vector<uint8_t> public_keys, std::vector<uint8_t> secret_keys, const unsigned int batch)
{
/*
* Set up for utilizing cuPQC ML-KEM Keygen.
* Allocates device workspace for computing
*/
auto length_public_key = MLKEM512Key::public_key_size;
auto length_secret_key = MLKEM512Key::secret_key_size;
auto workspace = make_workspace<MLKEM512Key>(batch);
auto randombytes = get_entropy<MLKEM512Key>(batch);
/*
* Allocate device memory for public and secret keys
*/
uint8_t* d_public_key = nullptr;
uint8_t* d_secret_key = nullptr;
cudaMalloc(&d_public_key, length_public_key * batch); //These are uint8_t so length and batch are in bytes
cudaMalloc(&d_secret_key, length_secret_key * batch);
keygen_kernel<<<batch, MLKEM512Key::BlockDim>>>(d_public_key, d_secret_key, workspace, randombytes);
/*
* Transfer generated keys to the host for communication or storage
*/
cudaMemcpy(public_keys.data(), d_public_key, length_public_key * batch, cudaMemcpyDeviceToHost);
cudaMemcpy(secret_keys.data(), d_secret_key, length_secret_key * batch, cudaMemcpyDeviceToHost);
/*
* Delete device memory associated with the cryptographic process.
*/
cudaFree(d_public_key);
cudaFree(d_secret_key);
destroy_workspace(workspace);
release_entropy(randombytes);
}
Compilation#
In order to compile a program which includes cupqc.hpp
, as well as the cuPQC API, users need to pass the location of the
cuPQC header, as well as link with the static library provided. Details on how to use cuPQC in your project can be found in
Quick Installation Guide. The cuPQC library is built using link-time-optimization (LTO),
so the appropriate flags are required when compiling to your specific architecture.
nvcc -std=c++17 -dlto -arch=sm_80 -L<path_to_static_lib> -lcupqc -I<path_to_cupqc_sdk_include> -I<path_to_commondx> <cupqc_intro_example.cu> -o cupqc_intro_example
For more information on the currently supported functionality see Requirements and Functionality.