Benchmarking#

Accuracy#

Accuracy benchmarking for RFdiffusion involves generating long sequences. However, given a fixed scaffolding input, there still exists hundreds of possible configurations at the same length. Therefore, we currently only test whether the sequences generated under fixed random number conditions in the NIM container match those generated by the public version on GitHub.

The sequences generated by the GitHub version have undergone manual verification beforehand. This test assesses NIM’s ability to faithfully reproduce these sequences. It’s important to note that accuracy can vary depending on the GPU microarchitecture. When executed on the same architecture, the outputs match exactly. When evaluated across different GPU architectures, we observe a Root Mean Square Error (RMSE) of less than 0.64 Ångströms between atoms in the reference dataset and the generated dataset.

Performance#

The run time of RFdiffusion depends on several factors, including the number of atoms in the inputs, the number and length of chains, and how many chains need to be generated.

To represent overall performance effectively, we measure RFdiffusion’s main performance characteristic as the average number of generated amino acids per second. We calculate this by normalizing performance results (in milliseconds) by the length of the generated sequence (number of amino acids) and the number of diffusion steps. This provides an intuitive performance metric. For instance, when considering the generation of a protein binder of a specific length using a certain number of diffusion steps, one can estimate the time by multiplying the performance metric by the protein length and then by the number of steps. This calculation gives an estimate of the time needed to generate one protein structure.

Starting with 2.0 release of RFdiffusion NIM, the model is optimized using NVIDIA Warp and NVIDIA TensorRT frameworks.

With optimizations, RFdiffusion NIM runs up to two times faster than non-optimized version. Our current measurements show varying performance across different GPU architectures. Using A100 GPUs (Ampere architecture), we observe up to 507 amino acids generated per second per step. L40 GPUs (Ada Lovelace architecture) demonstrate performance with up to 419 amino acids per second per step. H100 GPUs (Hopper architecture) shows the highest performance, reaching up to 826 amino acids per second per step.

Sample benchmarking scripts#

This NIM comes with a simple benchmarking script that can measure both accuracy and performance. It is useful to make sure that the neural network provides same results on some known proteins.

The script is already packaged in NIM’s docker image. You can view and study the benchmark using following command:

docker run --entrypoint cat nvcr.io/nim/ipd/rfdiffusion:2 /opt/nim/benchmarking.py

To execute the benchmark, follow this sequence:

Make sure NIM is running as described in Quickstart Guide.
Benchmark script automatically downloads test dataset. To save time and bandwidth it is recommended to provide local cache directory. This way the script will be able to reuse already downloaded data. Execute following command to setup cache directory.

export LOCAL_NIM_CACHE=~/.cache/nim

Execute the benchmark.

docker run -it --net host -v "$LOCAL_NIM_CACHE":/opt/nim/.cache --entrypoint "" \
    nvcr.io/nim/ipd/rfdiffusion:2 \
    /opt/nim/benchmarking.py --benchmark-type both