Performance#

Review performance benchmarks, memory requirements, and scaling characteristics for NIM for BGR.

Benchmark Metric#

The primary benchmark metric is the average time per atom per optimization step:

\[P = T / \sum_{i=1}^{N} A_i S_i\]
  • P: Time per atom per optimization step

  • T: Total time to perform optimization of all structures in the dataset

  • N: Number of structures

  • Ai: Number of atoms in structure i

  • Si: Number of steps required to optimize structure i

This metric normalizes performance across workloads where both the atom count and the number of optimization steps vary by structure.

Batch Size Estimation#

At startup, NIM for BGR automatically estimates the optimal batch size for each GPU by running benchmark calculations with representative structures.

Example Batch Size Estimation Logs
INFO     | BGR:cuda:0 |  Atom count 40448 took 396.18 ms, 9.79 μs/atom (2 runs)
INFO     | BGR:cuda:0 |  Atom count 39516 took 391.91 ms, 9.92 μs/atom (2 runs)
INFO     | BGR:cuda:0 |  Atom count 41360 took 418.03 ms, 10.11 μs/atom (2 runs)
INFO     | BGR:cuda:0 |  Estimated batch size: 36876, max: 47136

Performance Results#

The following benchmarks use the MACE-MP-0b2-Large model and the OMat24 dataset.

GPU

Time per Atom (μs/atom/step)

Estimated Batch Size (Atoms)

RTX 6000 Ada

7.34

~40,000

B200

2.16

~198,000

H100

3.02

~82,000

A100

6.65

~65,000

L40S

6.17

~48,000

Note

Performance varies based on structure size, model type, and whether cell optimization is enabled. Choose a representative dataset (for example, bulk crystals or isolated molecules) when estimating throughput for your workload.

Model Characteristics#

Different machine learning interatomic potential (MLIP) models exhibit varying computational characteristics. MACE is the default, bundled model. TensorNet and AIMNet2 generally offer faster inference and lower memory usage.

Model

Relative Speed

Memory Usage

Recommended Use Case

MACE

Fast

Moderate

General inorganic solids

TensorNet

Faster

Low

General inorganic solids

AIMNet2

Faster

Low

Organic molecular systems

Scaling Considerations#

The following factors influence the overall throughput and resource utilization of NIM for BGR:

  • Multi-GPU scaling: NIM for BGR automatically distributes work across all available GPUs, with each GPU maintaining its own batch queue.

  • Structure size: Larger structures require more memory per structure, which reduces the effective batch size.

  • Cell optimization: Enabling cell optimization adds stress tensor calculations and slightly increases computational cost.

  • DFT-D3 corrections: Dispersion corrections add modest overhead to each evaluation.

Benchmark Environment#

The following parameters define the benchmark reference environment:

Parameter

Value

Container Version

${__container_version}

Model

MACE-MP-0b2-Large (default, auto-downloaded)

Dataset

Representative crystal structures (varying atom counts)

Optimization

FIRE2, opttol = 0.005 eV/Å

PBC

Enabled