Performance#

Review performance benchmarks, memory requirements, and scaling characteristics for NIM for BGR.

Benchmark Metric#

The primary benchmark metric is the average time per atom per optimization step:

\[P = T / \sum_{i=1}^{N} A_i S_i\]

P: Time per atom per optimization step
T: Total time to perform optimization of all structures in the dataset
N: Number of structures
A_i: Number of atoms in structure i
S_i: Number of steps required to optimize structure i

This metric normalizes performance across workloads where both the atom count and the number of optimization steps vary by structure.

Batch Size Estimation#

At startup, NIM for BGR automatically estimates the optimal batch size for each GPU by running benchmark calculations with representative structures.

Performance Results#

The following benchmarks use the MACE-MPA-0 model and the OMat24 dataset.

GPU	Time per Atom (μs/atom/step)	Estimated Batch Size (Atoms)
RTX 6000 Ada	7.34	~40,000
B200	2.16	~198,000
H100	3.02	~82,000
A100	6.65	~65,000
L40S	6.17	~48,000

Note

Performance varies based on structure size, model type, and whether cell optimization is enabled. Choose a representative dataset (for example, bulk crystals or isolated molecules) when estimating throughput for your workload.

Model Characteristics#

Different machine learning interatomic potential (MLIP) models exhibit varying computational characteristics. MACE is the default, bundled model. TensorNet and AIMNet2 generally offer faster inference and lower memory usage.

Model	Relative Speed	Memory Usage	Recommended Use Case
MACE	Fast	Moderate	General inorganic solids
TensorNet	Faster	Low	General inorganic solids
AIMNet2	Faster	Low	Organic molecular systems

Scaling Considerations#

The following factors influence the overall throughput and resource utilization of NIM for BGR:

Multi-GPU scaling: NIM for BGR automatically distributes work across all available GPUs, with each GPU maintaining its own batch queue.
Structure size: Larger structures require more memory per structure, which reduces the effective batch size.
Cell optimization: Enabling cell optimization adds stress tensor calculations and slightly increases computational cost.
DFT-D3 corrections: Dispersion corrections add modest overhead to each evaluation.

Benchmark Environment#

The following parameters define the benchmark reference environment:

Parameter	Value
Container Version	`${__container_version}`
Model	MACE-MPA-0 (default, auto-downloaded)
Dataset	Representative crystal structures (varying atom counts)
Optimization	FIRE2, `opttol` = 0.005 eV/Å
PBC	Enabled