Performance in OpenFold3 NIM#
NIM Accuracy#
OpenFold3 is an all-atom biomolecular complex structure prediction model from the OpenFold Consortium and the AlQuraishi Laboratory. OpenFold3 is a PyTorch implementation of the jax-based AlphaFold3 reported in Accurate structure prediction of biomolecular interactions with AlphaFold 3, and like AlphaFold3, OpenFold3 extends protein structure prediction capabilities to model complete biomolecular complexes including proteins, DNA, RNA, and small molecule ligands.
The OpenFold3 NIM’s accuracy should match that of the AlQuraishi Laboratory implementation of OpenFold3, when using equivalent parameters and inputs.
Note
Running on hardware that is not listed as supported in the prerequisites section may produce results that deviate from the expected accuracy.
The accuracy of the NIM is measured by structural quality metrics such as lddt (local distance difference test). These scores help assess the reliability of the predicted structures.
Factors Affecting NIM Performance#
The performance of the OpenFold3 NIM is determined by several key factors:
Hardware Factors#
GPU type and memory: Different GPU architectures provide different performance levels
System RAM: Larger proteins and complexes require more system memory
Storage speed: Fast NVMe SSD storage improves model loading and caching performance
Input Complexity#
Sequence length: Runtime scales approximately linearly with total sequence length
Number of chains: Multi-chain complexes require more computation than single chains
MSA size: Larger MSAs can improve accuracy but increase memory usage and computation time
Ligands, DNA, RNA: Additional molecular components increase computational cost
Model Configuration#
Inference backend: TensorRT + cuEquivariance provides significant speedups over PyTorch
Diffusion samples: Multiple samples provide diversity but multiply computational cost
MSA depth: Deeper MSAs improve accuracy with modest runtime increase
Performance Characteristics#
Typical Runtimes#
For reference, approximate runtimes on high-end hardware (NVIDIA H100 80GB):
Structure Prediction (OpenFold3 v1.0.0 on H100 with TensorRT + cuEquivariance):
~200 residues: 11.8-14.5 seconds
~300-400 residues: 15.0-20.4 seconds
~500-600 residues: 21.5-25.7 seconds
~800-900 residues: 32.7 seconds
~1300-1500 residues: 46.5-67.9 seconds
~1700-1900 residues: 88.7-101.8 seconds
Memory usage: Varies with sequence length, typically 40-80GB GPU memory
Performance Notes The total runtime for structure prediction depends on:
Total number of residues in the complex
Total number of atoms in the complex
Number of molecules and chains
Number of sequences in the MSAs
Number of diffusion samples requested
OpenFold3 v1.0.0 Performance Results on H100 (NVIDIA H100 80GB)#
Structure Prediction Performance#
The following table shows runtime performance for OpenFold3 NIM v1.0.0 on NVIDIA H100 GPUs, across 3 different inference backends. These backends are described below in Backend Selection below, and in Backend Optimization Options
Test ID  | 
Sequence Length  | 
PyTorch + cuEquivariance (s)  | 
Open source OF3 from openfold consortium (baseline) (s)  | 
OF3 NIM (TRT + cuEquivariance) (s)  | 
Speedup (NIM vs Baseline)  | 
|---|---|---|---|---|---|
8eil  | 
186  | 
17.06  | 
16.79  | 
11.78  | 
1.42x  | 
7r6r  | 
203  | 
18.82  | 
19.26  | 
14.53  | 
1.33x  | 
1a3n  | 
287  | 
26.76  | 
29.13  | 
23.55  | 
1.24x  | 
8c4d  | 
331  | 
20.08  | 
18.85  | 
15.04  | 
1.25x  | 
7qsj  | 
375  | 
20.79  | 
19.92  | 
16.06  | 
1.24x  | 
8cpk  | 
384  | 
26.85  | 
23.50  | 
20.37  | 
1.15x  | 
8are  | 
530  | 
26.31  | 
27.42  | 
21.89  | 
1.25x  | 
8owf  | 
575  | 
27.54  | 
28.86  | 
22.90  | 
1.26x  | 
8aw3  | 
590  | 
41.96  | 
45.64  | 
37.69  | 
1.21x  | 
7tpu  | 
616  | 
25.69  | 
28.61  | 
21.48  | 
1.33x  | 
7ylz  | 
623  | 
31.80  | 
35.66  | 
27.51  | 
1.30x  | 
8gpp  | 
628  | 
29.92  | 
33.81  | 
25.71  | 
1.32x  | 
8clz  | 
684  | 
30.68  | 
35.35  | 
26.41  | 
1.34x  | 
8k7x  | 
858  | 
37.39  | 
44.76  | 
32.75  | 
1.37x  | 
8ibx  | 
1286  | 
52.12  | 
63.54  | 
46.51  | 
1.37x  | 
8gi1  | 
1464  | 
74.42  | 
99.45  | 
60.56  | 
1.64x  | 
8sm6  | 
1496  | 
83.33  | 
110.55  | 
67.92  | 
1.63x  | 
8pso  | 
1499  | 
72.68  | 
96.87  | 
57.41  | 
1.69x  | 
8jue  | 
1657  | 
98.09  | 
125.53  | 
78.04  | 
1.61x  | 
8bsh  | 
1762  | 
110.98  | 
144.03  | 
88.68  | 
1.62x  | 
5xgo  | 
1869  | 
127.79  | 
163.83  | 
101.79  | 
1.61x  | 
*All runtimes are in seconds for end-to-end structure prediction with a single diffusion sample
Performance Analysis#
Key Observations:
OpenFold3 NIM Optimization: Provides consistent 1.15x to 1.69x speedup over the open source OF3 baseline
cuEquivariance Acceleration: PyTorch + cuEquivariance shows speedups for larger proteins, demonstrating the value of cuEquivariance optimization
Scaling Behavior: Performance scales approximately linearly with sequence length
Best Performance: Largest speedups (1.6x-1.69x) observed for proteins in the 1400-1500 residue range
Small Proteins: Still achieve significant speedups (1.2x-1.4x) even for sequences under 400 residues
Backend Comparison:
Open source OF3 (PyTorch + DeepSpeed): Baseline implementation from the openfold consortium, good for development and debugging
PyTorch + cuEquivariance: Improved performance, especially for larger proteins, while maintaining PyTorch flexibility
OpenFold3 NIM (TensorRT + cuEquivariance): Best performance across all sequence lengths, recommended for production deployments
Recommended Configuration:
For development/testing: Use open source OF3 (PyTorch + DeepSpeed) for easier debugging and flexibility
For production: Use OpenFold3 NIM (TensorRT + cuEquivariance, default) for optimal performance
For large proteins (>1500 residues): OpenFold3 NIM (TensorRT + cuEquivariance) provides the best speedups (1.6x-1.7x)
For sequence lengths outside 4-2048 range: Use PyTorch backend as TensorRT has length limitations
Configuration#
The benchmarks above use the following configuration:
Parameter  | 
Setting  | 
|---|---|
diffusion_samples  | 
1  | 
output_format  | 
pdb  | 
GPU  | 
H100 80GB  | 
Performance Optimization Tips#
Backend Selection#
Default (TensorRT + cuEquivariance): Best for most use cases
# This is the default, no environment variable neededPyTorch + cuEquivariance: For flexibility with good performance
export NIM_OPTIMIZED_BACKEND=torch
PyTorch + DeepSpeed: For debugging or sequences outside TRT range
export NIM_OPTIMIZED_BACKEND=torch_baseline
General Optimization Tips#
GPU Selection: Use H100 or H200 GPUs for optimal performance. A100 GPUs are also supported.
TRT Sequence Length Limits: TensorRT mode (default) supports sequences between 4 and 2048 residues. For sequences outside this range, use PyTorch backend.
Sequence Length: Performance scales approximately linearly with sequence length.
Multiple Samples: Setting
diffusion_samples > 1will increase runtime in affine fashion (input featurization time is independent ofdiffusion_samples).MSA Size: While larger MSAs can improve accuracy, they also increase memory usage and computation time. Consider filtering MSAs for very large proteins.
Batch Processing: For multiple independent predictions, process them sequentially or use multiple NIM instances.
Memory Management: Ensure adequate GPU memory for your target sequence lengths. Very long sequences (>1800 residues) may require H100/H200 80GB GPUs.
Troubleshooting#
General Performance Issues#
Out of memory errors: Reduce MSA depth, decrease sequence length, or upgrade to GPUs with more memory
Slow performance: Ensure fast storage (NVMe SSD), sufficient CPU cores, and adequate system RAM
Poor quality predictions: Check input sequence quality, increase MSA depth if available, or adjust diffusion parameters
Backend-Specific Issues#
TensorRT errors with long sequences: Use PyTorch backend for sequences >2048 residues
export NIM_OPTIMIZED_BACKEND=torch
TensorRT errors with short sequences: Use PyTorch backend for sequences <4 residues
export NIM_OPTIMIZED_BACKEND=torch
Inconsistent results between backends: This is expected; TensorRT uses optimizations that may produce slightly different numerical results while maintaining accuracy
Note
For detailed performance tuning guidance specific to your deployment, refer to the documentation on configuration and optimization.