Performance#

Review performance benchmarks, memory requirements, and scaling characteristics for NIM for BMD.

Benchmark Metric#

The primary benchmark metric is the average time per molecular dynamics (MD) step per atom:

\[P = T / (N \times S)\]
  • P: Time per step per atom

  • T: Total wall-clock time

  • N: Number of atoms

  • S: Number of simulation steps

This metric normalizes performance across system sizes and facilitates throughput comparison.

Batch Size Estimation#

At startup, NIM for BMD automatically estimates the optimal batch size for each GPU by running benchmark calculations with representative structures.

Example Batch Size Estimation Logs
INFO     | BMD:cuda:0 |  Atom count 40448 took 396.18 ms, 9.79 μs/atom (2 runs)
INFO     | BMD:cuda:0 |  Atom count 39516 took 391.91 ms, 9.92 μs/atom (2 runs)
INFO     | BMD:cuda:0 |  Atom count 41360 took 418.03 ms, 10.11 μs/atom (2 runs)
INFO     | BMD:cuda:0 |  Estimated batch size: 36876, max: 47136

Performance Results#

The following benchmarks use the MACE-MP-0b2-Large model for NVT simulations.

GPU

Time per Atom (μs/atom/step)

Estimated Batch Size (Atoms)

RTX 6000 Ada

7.34

~40,000

B200

2.16

~198,000

H100

3.02

~82,000

A100

6.65

~65,000

L40S

6.17

~48,000

Note

Performance varies based on system size, model type, ensemble, and save interval. Choose a representative dataset (for example, bulk crystals or isolated molecules) when estimating throughput for your specific workload.

Model Characteristics#

Different machine learning interatomic potential (MLIP) models exhibit varying computational characteristics. MACE is the default, bundled model. TensorNet and AIMNet2 generally offer faster inference and lower memory usage.

Model

Relative Speed

Memory Usage

Recommended Use Case

MACE

Fast

Moderate

General inorganic solids

TensorNet

Faster

Low

General inorganic solids

AIMNet2

Faster

Low

Organic molecular systems

Ensemble Characteristics#

Different ensembles introduce varying computational overhead.

Ensemble

Relative Speed

Notes

NVE

Fastest

No thermostat or barostat overhead.

NVT

~5% slower than NVE

Langevin thermostat adds random forces.

NPT

~10% to 20% slower than NVT

Monte Carlo barostat attempts cell moves.

Scaling Considerations#

  • System size: Larger systems typically achieve better GPU utilization, which improves time-per-atom efficiency.

  • Multi-GPU: The NIM automatically distributes work across all available GPUs.

  • Trajectory output: High-frequency trajectory saving (a small save_interval) can cause an I/O bottleneck during fast simulations.

  • Barostat frequency: For NPT simulations, barostat_every controls the frequency of cell move attempts. Higher values reduce overhead.

Memory Requirements#

Memory usage scales with the system size. The following approximate values apply to MACE models:

System Size

Approximate GPU Memory

1,000 atoms

~2 GB

10,000 atoms

~12 GB

50,000 atoms

~50 GB

100,000 atoms

~100 GB

Note

Memory requirements vary based on model architecture, cell size, neighbor list size, and system density.

Benchmark Environment#

Parameter

Value

Container Version

1.0.0

Model

MACE-MP-0b2-Large (default, auto-downloaded)

Ensemble

NVT, temperature=300K

PBC

Enabled