Performance#
Review performance benchmarks, memory requirements, and scaling characteristics for NIM for BGR.
Benchmark Metric#
The primary benchmark metric is the average time per atom per optimization step:
P: Time per atom per optimization step
T: Total time to perform optimization of all structures in the dataset
N: Number of structures
Ai: Number of atoms in structure i
Si: Number of steps required to optimize structure i
This metric normalizes performance across workloads where both the atom count and the number of optimization steps vary by structure.
Batch Size Estimation#
At startup, NIM for BGR automatically estimates the optimal batch size for each GPU by running benchmark calculations with representative structures.
Example Batch Size Estimation Logs
INFO | BGR:cuda:0 | Atom count 40448 took 396.18 ms, 9.79 μs/atom (2 runs)
INFO | BGR:cuda:0 | Atom count 39516 took 391.91 ms, 9.92 μs/atom (2 runs)
INFO | BGR:cuda:0 | Atom count 41360 took 418.03 ms, 10.11 μs/atom (2 runs)
INFO | BGR:cuda:0 | Estimated batch size: 36876, max: 47136
Performance Results#
The following benchmarks use the MACE-MP-0b2-Large model and the OMat24 dataset.
GPU |
Time per Atom (μs/atom/step) |
Estimated Batch Size (Atoms) |
|---|---|---|
RTX 6000 Ada |
7.34 |
~40,000 |
B200 |
2.16 |
~198,000 |
H100 |
3.02 |
~82,000 |
A100 |
6.65 |
~65,000 |
L40S |
6.17 |
~48,000 |
Note
Performance varies based on structure size, model type, and whether cell optimization is enabled. Choose a representative dataset (for example, bulk crystals or isolated molecules) when estimating throughput for your workload.
Model Characteristics#
Different machine learning interatomic potential (MLIP) models exhibit varying computational characteristics. MACE is the default, bundled model. TensorNet and AIMNet2 generally offer faster inference and lower memory usage.
Model |
Relative Speed |
Memory Usage |
Recommended Use Case |
|---|---|---|---|
MACE |
Fast |
Moderate |
General inorganic solids |
TensorNet |
Faster |
Low |
General inorganic solids |
AIMNet2 |
Faster |
Low |
Organic molecular systems |
Scaling Considerations#
The following factors influence the overall throughput and resource utilization of NIM for BGR:
Multi-GPU scaling: NIM for BGR automatically distributes work across all available GPUs, with each GPU maintaining its own batch queue.
Structure size: Larger structures require more memory per structure, which reduces the effective batch size.
Cell optimization: Enabling cell optimization adds stress tensor calculations and slightly increases computational cost.
DFT-D3 corrections: Dispersion corrections add modest overhead to each evaluation.
Benchmark Environment#
The following parameters define the benchmark reference environment:
Parameter |
Value |
|---|---|
Container Version |
|
Model |
MACE-MP-0b2-Large (default, auto-downloaded) |
Dataset |
Representative crystal structures (varying atom counts) |
Optimization |
FIRE2, |
PBC |
Enabled |