Performance in Boltz-2 NIM#

NIM Accuracy#

The Boltz-2 NIM is based on the state-of-the-art Boltz-2 architecture for biomolecular structure prediction. The NIM’s accuracy should match that of the reference implementation when using equivalent parameters and inputs.

Note

Running on hardware that is not listed as supported in the prerequisites section may produce results that deviate from the expected accuracy.

The accuracy of the NIM is measured by structural quality metrics such as lddt. These scores help assess the reliability of the predicted structures.

Factors Affecting NIM Performance#

The performance of the Boltz-2 NIM is determined by several key factors:

Hardware Factors#

  • Number and type of GPUs: More GPUs generally improve throughput for concurrent requests

  • GPU memory: Larger proteins and complexes require more GPU memory

  • Storage speed: Fast NVMe SSD storage improves model loading and caching performance

Input Complexity#

  • Sequence length: Runtime scales approximately quadratically with total sequence length

  • Number of chains: Multi-chain complexes require more computation than single chains

  • Ligands and constraints: Additional molecular components increase computational cost

Model Parameters#

  • Sampling steps: Higher values improve quality but significantly increase runtime

  • Recycling steps: More iterations improve accuracy with modest runtime increase

  • Diffusion samples: Multiple samples provide diversity but multiply computational cost

Performance Characteristics#

Typical Runtimes#

For reference, approximate runtimes on high-end hardware (8x NVIDIA H100 80GB):

Structure Prediction (Boltz2 v1.3.0 on H100):

  • ~200 residues: 2.6-4.6 seconds (TensorRT) / 5.7-7.6 seconds (PyTorch)

  • ~500 residues: 9.0-9.3 seconds (TensorRT) / 13.1-13.5 seconds (PyTorch)

  • ~1000 residues: 29.2-35.2 seconds (TensorRT) / 35.5-50.4 seconds (PyTorch)

  • ~1500 residues: 32.1-41.3 seconds (TensorRT) / 51.3-68.5 seconds (PyTorch)

  • ~2000 residues: 85.8 seconds (TensorRT) / 123.1 seconds (PyTorch)

  • Confidence score: Typically > 0.7 for well-folded proteins

  • Memory usage: ~4-8 GB GPU memory per prediction

Binding Affinity Prediction (Boltz2 v1.3.0 on H100):

  • ~200 residues: 7.6-11.6 seconds (TensorRT) / 16.8-22.0 seconds (PyTorch)

  • ~500 residues: 13.6-16.6 seconds (TensorRT) / 25.2-26.4 seconds (PyTorch)

  • ~1000 residues: 35.2-50.8 seconds (TensorRT) / 50.4-79.3 seconds (PyTorch)

  • ~1500 residues: 35.7-50.8 seconds (TensorRT) / 61.6-80.9 seconds (PyTorch)

  • ~2000 residues: 89.2 seconds (TensorRT) / 130.7 seconds (PyTorch)

Boltz2 v1.3.0 Performance Results on H100#

Structure Prediction Performance#

The following table shows runtime performance for Boltz2 v1.3.0 on NVIDIA H100 GPUs for structure prediction across different sequence lengths:

Sequence Length

PyTorch Runtime (s)

TensorRT Runtime (s)

Speedup

186

5.67

2.60

2.18x

331

7.58

4.61

1.64x

375

8.56

5.25

1.63x

384

11.42

7.95

1.44x

530

13.07

9.00

1.45x

575

13.53

9.28

1.46x

616

13.47

9.00

1.50x

623

15.55

10.92

1.42x

628

15.05

10.45

1.44x

684

16.59

11.77

1.41x

858

23.15

17.77

1.30x

1286

35.48

29.22

1.21x

1464

51.66

33.67

1.53x

1496

59.99

41.33

1.45x

1499

51.32

32.13

1.60x

1588

64.28

42.53

1.51x

1755

85.22

56.81

1.50x

2033

123.07

85.81

1.43x

Binding Affinity Prediction Performance#

The following table shows runtime performance for binding affinity prediction on NVIDIA H100 GPUs:

Sequence Length

PyTorch Runtime (s)

TensorRT Runtime (s)

Speedup

186

16.81

7.56

2.22x

331

21.99

11.64

1.89x

375

23.18

10.10

2.29x

384

23.94

11.51

2.08x

530

25.21

13.98

1.80x

575

26.42

16.62

1.59x

616

25.61

13.64

1.88x

623

27.62

17.86

1.55x

628

29.14

17.12

1.70x

684

29.56

18.74

1.58x

858

36.50

23.68

1.54x

1286

50.41

35.24

1.43x

1464

68.45

41.88

1.63x

1496

79.26

50.75

1.56x

1499

61.59

35.66

1.73x

1588

80.94

49.85

1.62x

1755

93.29

60.88

1.53x

2033

130.67

89.20

1.47x

Performance Analysis#

Key Observations:

  • TensorRT Optimization: TensorRT consistently provides 1.2x to 2.3x speedup over PyTorch across all sequence lengths

  • Structure Prediction: Average speedup of ~1.5x with TensorRT, with higher speedups for shorter sequences

  • Affinity Prediction: Generally higher speedups (1.4x to 2.3x) compared to structure prediction, with peak performance at medium sequence lengths (300-400 residues)

  • Scaling Behavior: Performance scales approximately quadratically with sequence length for both PyTorch and TensorRT implementations

  • Memory Efficiency: TensorRT optimization also provides better memory utilization, enabling processing of larger proteins

Recommended Configuration:

  • For development/testing: Use TensorRT with reduced sampling steps (25-50) for faster iteration

  • For production: Use TensorRT with default settings for optimal balance of speed and accuracy

  • For large proteins (>1500 residues): Consider domain-based approaches or increased GPU memory allocation

Performance Testing#

You can test basic performance and functionality with protein structure and binding affinity prediction:

import requests
import json
import time

def test_boltz2_structure_performance():
    """Test basic protein structure prediction performance."""
    url = "http://localhost:8000/biology/mit/boltz2/predict"
    
    # Test protein: Green Fluorescent Protein (GFP) - ~240 residues
    test_sequence = (
        "MSKGEELFTGVVPILVELDGDVNGHKFSVSGEGEGDATYGKLTLKFICTTGKLPVPWPTLVTTFCYGD"
        "QIQEQYKGIPLDGDQVQAVNGHEFEIEGEGEGRPYEGTQTAQLNKFCDKLPVMHYKQFFDSGNYNTLS"
        "AKAGFPFKVPHTYNNSSFVVKQKPGMVFKFIHGKDPGLNGQTVFLMVGGISQNLSGSSNLGVGYTFVQ"
        "KTSVLLESEIKKRLRGFHTRGAVTQGLHQFVNLPTLVTQVLDGDMSQLLQVT"
    )
    
    data = {
        "polymers": [
            {
                "id": "A",
                "molecule_type": "protein",
                "sequence": test_sequence
            }
        ],
        "recycling_steps": 3,
        "sampling_steps": 50,
        "diffusion_samples": 1,
        "step_scale": 1.638,
        "output_format": "mmcif"
    }
    
    print("Starting Boltz-2 structure prediction performance test...")
    start_time = time.time()
    
    response = requests.post(url, json=data)
    
    end_time = time.time()
    runtime = end_time - start_time
    
    if response.status_code == 200:
        result = response.json()
        confidence = result.get('confidence_scores', [0])[0]
        print(f"✓ Structure prediction successful")
        print(f"✓ Runtime: {runtime:.1f} seconds")
        print(f"✓ Confidence score: {confidence:.3f}")
        print(f"✓ Structure format: {result['structures'][0]['format']}")
        return True
    else:
        print(f"✗ Structure prediction failed: {response.status_code}")
        print(f"✗ Error: {response.text}")
        return False

def test_boltz2_affinity_performance():
    """Test binding affinity prediction performance."""
    url = "http://localhost:8000/biology/mit/boltz2/predict"
    
    # Smaller test protein for affinity testing - ~140 residues
    hemoglobin_alpha = (
        "MVLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSHGSAQVKGHGKKVADALT"
        "NAVAHVDDMPNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTV"
        "LTSKYR"
    )
    
    data = {
        "polymers": [
            {
                "id": "A",
                "molecule_type": "protein",
                "sequence": hemoglobin_alpha
            }
        ],
        "ligands": [
            {
                "id": "HEME",
                "smiles": "[Fe+2].C1=CC2=NC1=CC3=NC(=CC4=NC(=CC5=NC(=C2)C=C5)C=C4)C=C3",
                "predict_affinity": True
            }
        ],
        "recycling_steps": 3,
        "sampling_steps": 50,
        "sampling_steps_affinity": 100,  # Reduced for performance testing
        "diffusion_samples_affinity": 3,  # Reduced for performance testing
        "output_format": "mmcif"
    }
    
    print("Starting Boltz-2 binding affinity performance test...")
    start_time = time.time()
    
    response = requests.post(url, json=data)
    
    end_time = time.time()
    runtime = end_time - start_time
    
    if response.status_code == 200:
        result = response.json()
        confidence = result.get('confidence_scores', [0])[0]
        print(f"✓ Affinity prediction successful")
        print(f"✓ Runtime: {runtime:.1f} seconds")
        print(f"✓ Confidence score: {confidence:.3f}")
        
        if result.get("affinities"):
            for ligand_id, affinity_data in result["affinities"].items():
                if affinity_data.get("affinity_pic50"):
                    print(f"✓ Predicted pIC50 for {ligand_id}: {affinity_data['affinity_pic50'][0]:.2f} kcal/mol")
        return True
    else:
        print(f"✗ Affinity prediction failed: {response.status_code}")
        print(f"✗ Error: {response.text}")
        return False

if __name__ == "__main__":
    # Test both structure and affinity prediction
    structure_success = test_boltz2_structure_performance()
    print("\n" + "="*50 + "\n")
    affinity_success = test_boltz2_affinity_performance()
    
    if structure_success and affinity_success:
        print("\n✓ All performance tests passed!")
    else:
        print("\n✗ Some performance tests failed.")

Performance Optimization Tips#

  1. For development/testing: Use faster settings

    "recycling_steps": 2,
    "sampling_steps": 25
    
  2. For production quality: Use higher quality settings

    "recycling_steps": 5,
    "sampling_steps": 100
    
  3. For batch processing: Submit multiple concurrent requests with default settings

  4. For very large proteins (>1000 residues): Consider domain-based approaches or consult the literature for handling strategies

Troubleshooting Performance Issues#

Common Issues and Solutions#

General Performance Issues#

  • Out of memory errors: Reduce sequence length, decrease sampling steps, or use fewer concurrent requests

  • Slow performance: Ensure fast storage (NVMe SSD), sufficient CPU cores (12+ per GPU), and adequate system RAM (48+ GB per GPU)

  • Poor quality predictions: Increase sampling steps, recycling steps, or check input sequence quality

Binding Affinity Specific Issues#

  • Affinity prediction timeouts: Reduce sampling_steps_affinity and diffusion_samples_affinity for faster results

  • Unrealistic affinity values: Enable affinity_mw_correction for metal-containing ligands and verify SMILES format

  • Memory errors with affinity: Binding affinity requires 2-3x more memory than structure prediction alone

  • Inconsistent affinity results: Increase diffusion_samples_affinity to 5-7 for more reliable estimates

  • Cannot predict affinity for multiple ligands: Only one ligand per request can have predict_affinity=True

Note

For detailed performance tuning guidance specific to your deployment, refer to the Optimization section.