Random Number Generation Using cuRANDDx#

In this introduction, we will use cuRANDDx library to generate 10000 FP64 random numbers with standard normal distributions. This section is based on the Introduction Example shipped with cuRANDDx. Refer to the Examples section for other cuRANDDx samples.

Defining a Function Descriptor#

NVIDIA MathDX uses a list of operators to create a full description of the problem to solve. Therefore, the first step is to define a function descriptor by adding together cuRANDDx operators that capture the properties of the function. The correctness of this description is evaluated at compile time.

To define a descriptor to generate random numbers using a specific generator, we just need to write the following lines:

#include <curanddx.hpp>

using RNG = decltype(curanddx::Generator<curanddx::philox4_32>());

In order to encode the operation properties, cuRANDDx provides operators Generator and PhiloxRounds, which can be combined by using addition (+).

Note

There is no operator for the distribution type, nor the output data type. These information is provided in run time for more flexibility, allowing generating multiple distributions using the same generator and its underlying states.

To obtain a fully usable operation that executes the function in CUDA kernels, we need to provide at least two additional pieces of information:

The first one is the SM operator which indicates the targeted CUDA architecture on which we want to run the function. In this example we assume targeting A100 GPUs (SM<800>()).
Secondly, use the Thread Operator to show that the routine will be performed on the thread level in a CUDA kernel. At this point, cuRANDDx performs additional verifications to make sure provided description is valid and that it is possible to execute it on the requested architecture.

#include <curanddx.hpp>

using RNG = decltype(curanddx::Generator<curanddx::philox4_32>() +
                     curanddx::SM<800>() +
                     curanddx::Thread());

Executing the Functions#

Class RNG which describes the function can be instantiated into object (or objects). Forming the object has no computational cost, and should be seen as a handle. The function descriptor object provides execution functions that can perform the requested function.

#include <curanddx.hpp>

using RNG = decltype(curanddx::Generator<curanddx::philox4_32>() +
                     curanddx::SM<800>() +
                     curanddx::Thread());

__global__ void generate_kernel(double4*                        d_out,
                                const unsigned long long        seed,
                                const typename RNG::offset_type offset) {
  const int i = blockDim.x * blockIdx.x + threadIdx.x;

  // Initialize state with seed, subsequence, offset
  RNG rng(seed, ((offset + i) % 65536), ((offset + i) / 65536));

  // Specify the distribution and the method used, either box-muller or icdf
  curanddx::normal<double, curanddx::box_muller> dist(0, 1);


  // Generate the random numbers and copy the output to global memory
  d_out[i] = dist.generate4(rng);
}

As demonstrated in the kernel above, each thread follows the three steps:

Setting up an initial state in its register using the given seed, subsequence number, and offset within the subsequence. The above example intents to reproduce exactly the same random numbers as the ones using cuRAND host API with CURAND_ORDERING_PSEUDO_LEGACY ordering, which uses 65536 different subsequences, and each four values from one subsequence are followed by four values from next subsequence. Refer to Philox_4x32_10 ordering for detail.
Instantiating a probability distribution object, see RNG with distribution for detail.
Generating the random bits and transforming into random values from the specified distribution and output data type.

Launching the Kernel#

Writing a kernel which uses the defined RNG with thread-level execution operator is straightforward, because no shared memory is needed. For simplicity, the example below shows only the main component of the code. Please check the full Introduction Example, as well as others shipped with cuRANDDx, for more detailed examples.

#include <curanddx.hpp>

template<class RNG>
__global__ void generate_kernel(double4* d_out,
                                const unsigned long long seed,
                                const typename RNG::offset_type offset,
                                const size_t size) {
  const int i = blockDim.x * blockIdx.x + threadIdx.x;
  if (i >= size / 4)
      return;

  RNG rng(seed, ((offset + i) % 65536), ((offset + i) / 65536));

  curanddx::normal<double, curanddx::box_muller> dist(0, 1);

  d_out[i] = dist.generate4(rng);
}


int simple_philox_thread_api() {
  using RNG = decltype(curanddx::Generator<curanddx::philox4_32>() +
                       curanddx::PhiloxRounds<10>() +
                       curanddx::SM<800>() + curanddx::Thread());
  using DataType = double;

  // Allocate output memory
  DataType*    d_out;
  const size_t size = 10000;
  cudaMalloc((void**)&d_out, size * sizeof(DataType));

  unsigned long long seed   = 1234ULL;
  typename RNG::offset_type offset = 1ULL;

  // Invokes kernel
  const unsigned int block_dim = 256;
  const unsigned int grid_size = (size / 4 + block_dim - 1) / block_dim;
  generate_kernel<RNG><<<grid_size, block_dim, 0>>>((double4*)d_out, seed, offset, size);
  cudaPeekAtLastError();
  cudaDeviceSynchronize();

  std::vector<DataType> h_out(size);
  cudaMemcpy(h_out.data(), d_out, size * sizeof(DataType), cudaMemcpyDeviceToHost);

  // Execute with cuRAND host API for reference

  // Compare Results between cuRAND host API output and cuRANDDx output
}

Note that cuRANDDx generates data in register and does not require copying data back to global memory after executing an operation. This can offer performance advantage when fusing RNG functions with other computational operations.

Compilation#

For instructions on how to compile programs with cuRANDDx see Installation Guide.