Achieving High Performance#

Below we present general and advanced advices that may help in achieving high performance using cuRANDDx.

General Advices#

  • Best performance from the cuRANDDx library is achieved by generating blocks of random numbers that are as large as possible to fill the GPU for peak performance.

  • The default pseudorandom generator, PCG, is fast in setting up initial states and generating random bits.

  • In cases of fused operations, use directly the generated numbers in register. Avoid reading/writing data from/to global memory unnecessarily.

  • For curanddx::philox4_32 generator, each thread computes 4 random numbers at a time thus the most efficient use of Philox generator is to generate a multiple of 4 times number of threads. Setting round count to be 7 for better performance than the default value, 10, while still having full Crush-resistance (see reference).

  • Experiment with different block size for optimal performance.

  • Initialization of the generator state generally requires more registers than random number generation, thus for some generators it may be beneficial to experiment separating calls to initialize the states and generate random numbers into separate kernels for better performance.

  • State setup can be an expensive operation for some generators such as curanddx::xorwow and curanddx::mrg32k3a. One way to speed up the setup is to use different seeds for each thread and a constant sequence number of 0. But be aware that while faster to set up, this method provides less guarantees about the mathematical properties of the generated sequences.

Advanced Advices#

Further Reading#

References#