Important
NeMo 2.0 is an experimental feature and currently released in the dev container only: nvcr.io/nvidia/nemo:dev. Please refer to NeMo 2.0 overview for information on getting started.
Mamba2 and Hybrid Models
State Space Models (SSMs) have recently emerged as a promising alternative to transformers. SSMs offer advantages such as linear time complexity relative to sequence length and a constant cache size for inference. These features enable the processing of longer sequences and higher throughput. Despite these benefits, SSMs alone may fall short compared to transformers on tasks that demand strong copying or in-context learning capabilities.
To harness the strengths of both approaches, SSM-Hybrid models incorporate MLP, Transformer, and SSM blocks in their architecture. As highlighted in a study by NVIDIA, these hybrid models outperform traditional transformers of the same size by achieving faster inference times due to the inclusion of SSM blocks. Based on experimental results, Mamba2-Hybrid models not only surpass transformer baselines in performance but also benefit from increased computational efficiency.
The Mamba2 models discussed in the Transformers are SSMs paper are available in five different sizes: 130 million, 370 million, 780 million, 1.3 billion, and 2.7 billion parameters. The Mamba2-Hybrid models, along with their Mamba2 baseline as released by NVIDIA, are provided in an 8 billion parameter size.
NVIDIA NeMo Framework provides tools to perform fine-tuning on Mamba2 and Mamba2-Hybrid to fit your use case.
Feature |
Status |
---|---|
Data parallelism |
✓ |
Tensor parallelism |
✓ |
Pipeline parallelism |
✗ |
Interleaved Pipeline Parallelism Sched |
N/A |
Sequence parallelism |
✗ |
Selective activation checkpointing |
✓ |
Gradient checkpointing |
✓ |
Partial gradient checkpointing |
✓ |
FP32/TF32 |
✓ |
AMP/FP16 |
✗ |
BF16 |
✓ |
TransformerEngine/FP8 |
✗ |
Multi-GPU |
✗ |
Slurm |
✓ |
Base Command Manager |
✓ |
Base Command Platform |
✓ |
Distributed data preprocessing |
✓ |
NVfuser |
✗ |
P-Tuning and Prompt Tuning |
✗ |
Adapter learning |
✗ |
Distributed Optimizer |
✓ |
Distributed Checkpoint |
✗ |