Performance#
Review performance benchmarks, memory requirements, and scaling characteristics for NIM for BMD.
Benchmark Metric#
The primary benchmark metric is the average time per molecular dynamics (MD) step per atom:
P: Time per step per atom
T: Total wall-clock time
N: Number of atoms
S: Number of simulation steps
This metric normalizes performance across system sizes and facilitates throughput comparison.
Batch Size Estimation#
At startup, NIM for BMD automatically estimates the optimal batch size for each GPU by running benchmark calculations with representative structures.
Example Batch Size Estimation Logs
INFO | BMD:cuda:0 | Atom count 40448 took 396.18 ms, 9.79 μs/atom (2 runs)
INFO | BMD:cuda:0 | Atom count 39516 took 391.91 ms, 9.92 μs/atom (2 runs)
INFO | BMD:cuda:0 | Atom count 41360 took 418.03 ms, 10.11 μs/atom (2 runs)
INFO | BMD:cuda:0 | Estimated batch size: 36876, max: 47136
Performance Results#
The following benchmarks use the MACE-MP-0b2-Large model for NVT simulations.
GPU |
Time per Atom (μs/atom/step) |
Estimated Batch Size (Atoms) |
|---|---|---|
RTX 6000 Ada |
7.34 |
~40,000 |
B200 |
2.16 |
~198,000 |
H100 |
3.02 |
~82,000 |
A100 |
6.65 |
~65,000 |
L40S |
6.17 |
~48,000 |
Note
Performance varies based on system size, model type, ensemble, and save interval. Choose a representative dataset (for example, bulk crystals or isolated molecules) when estimating throughput for your specific workload.
Model Characteristics#
Different machine learning interatomic potential (MLIP) models exhibit varying computational characteristics. MACE is the default, bundled model. TensorNet and AIMNet2 generally offer faster inference and lower memory usage.
Model |
Relative Speed |
Memory Usage |
Recommended Use Case |
|---|---|---|---|
MACE |
Fast |
Moderate |
General inorganic solids |
TensorNet |
Faster |
Low |
General inorganic solids |
AIMNet2 |
Faster |
Low |
Organic molecular systems |
Ensemble Characteristics#
Different ensembles introduce varying computational overhead.
Ensemble |
Relative Speed |
Notes |
|---|---|---|
NVE |
Fastest |
No thermostat or barostat overhead. |
NVT |
~5% slower than NVE |
Langevin thermostat adds random forces. |
NPT |
~10% to 20% slower than NVT |
Monte Carlo barostat attempts cell moves. |
Scaling Considerations#
System size: Larger systems typically achieve better GPU utilization, which improves time-per-atom efficiency.
Multi-GPU: The NIM automatically distributes work across all available GPUs.
Trajectory output: High-frequency trajectory saving (a small
save_interval) can cause an I/O bottleneck during fast simulations.Barostat frequency: For NPT simulations,
barostat_everycontrols the frequency of cell move attempts. Higher values reduce overhead.
Memory Requirements#
Memory usage scales with the system size. The following approximate values apply to MACE models:
System Size |
Approximate GPU Memory |
|---|---|
1,000 atoms |
~2 GB |
10,000 atoms |
~12 GB |
50,000 atoms |
~50 GB |
100,000 atoms |
~100 GB |
Note
Memory requirements vary based on model architecture, cell size, neighbor list size, and system density.
Benchmark Environment#
Parameter |
Value |
|---|---|
Container Version |
|
Model |
MACE-MP-0b2-Large (default, auto-downloaded) |
Ensemble |
NVT, temperature=300K |
PBC |
Enabled |