Performance#

Review performance benchmarks, memory requirements, and scaling characteristics for NIM for BMD.

Benchmark Metric#

The primary benchmark metric is the average time per molecular dynamics (MD) step per atom:

\[P = T / (N \times S)\]

P: Time per step per atom
T: Total wall-clock time
N: Number of atoms
S: Number of simulation steps

This metric normalizes performance across system sizes and facilitates throughput comparison.

Batch Size Estimation#

At startup, NIM for BMD automatically estimates the optimal batch size for each GPU by running benchmark calculations with representative structures.

Performance Results#

The following benchmarks use the MACE-MPA-0 model for NVT simulations.

GPU	Time per Atom (μs/atom/step)	Estimated Batch Size (Atoms)
RTX 6000 Ada	7.34	~40,000
B200	2.16	~198,000
H100	3.02	~82,000
A100	6.65	~65,000
L40S	6.17	~48,000

Note

Performance varies based on system size, model type, ensemble, and save interval. Choose a representative dataset (for example, bulk crystals or isolated molecules) when estimating throughput for your specific workload.

Model Characteristics#

Different machine learning interatomic potential (MLIP) models exhibit varying computational characteristics. MACE is the default, bundled model. TensorNet and AIMNet2 generally offer faster inference and lower memory usage.

Model	Relative Speed	Memory Usage	Recommended Use Case
MACE	Fast	Moderate	General inorganic solids
TensorNet	Faster	Low	General inorganic solids
AIMNet2	Faster	Low	Organic molecular systems

Ensemble Characteristics#

Different ensembles introduce varying computational overhead.

Ensemble	Relative Speed	Notes
NVE	Fastest	No thermostat or barostat overhead.
NVT	~5% slower than NVE	Langevin thermostat adds random forces.
NPT	~10% to 20% slower than NVT	Monte Carlo barostat attempts cell moves.

Scaling Considerations#

System size: Larger systems typically achieve better GPU utilization, which improves time-per-atom efficiency.
Multi-GPU: The NIM automatically distributes work across all available GPUs.
Trajectory output: High-frequency trajectory saving (a small save_interval) can cause an I/O bottleneck during fast simulations.
Barostat frequency: For NPT simulations, barostat_every controls the frequency of cell move attempts. Higher values reduce overhead.

Memory Requirements#

Memory usage scales with the system size. The following approximate values apply to MACE models:

System Size	Approximate GPU Memory
1,000 atoms	~2 GB
10,000 atoms	~12 GB
50,000 atoms	~50 GB
100,000 atoms	~100 GB

Note

Memory requirements vary based on model architecture, cell size, neighbor list size, and system density.

Benchmark Environment#

Parameter	Value
Container Version	`1.0.0`
Model	MACE-MPA-0 (default, auto-downloaded)
Ensemble	NVT, temperature=300K
PBC	Enabled