Important
NeMo 2.0 is an experimental feature and currently released in the dev container only: nvcr.io/nvidia/nemo:dev. Please refer to NeMo 2.0 overview for information on getting started.
Mixtral
Released in December 2023, Mistral AI’s second marquee model, Mixtral-8x7B, is one of the first performant and open-source (Apache 2.0) Sparse Mixture of Experts Model (SMoE). The key distinguishing feature of Mixtral’s SMoE implementation, compared to Mistral 7B, is the inclusion of a router network that guides tokens through a set of two groups of parameters (experts) of a possible eight. This allows the model to perform better and be significantly larger without a corresponding significant increase in cost and latency. More specific details are available in the companion paper “Mixtral of Experts”.
Released in April 2024, Mistral AI’s second SMoE model, Mixtral-8x22B sets a new standard for performance and efficiency within the AI community. It is a sparse Mixture-of-Experts (SMoE) model that uses only 39B active parameters out of 141B, offering unparalleled cost efficiency for its size “announcement page”.
In the following documentation pages we use the terms “mixtal” and “mixtral_8x22b” to refer to the Mixtral-8x7B and Mixtral-8x22B models respectively.
Feature |
Status |
---|---|
Data parallelism |
✓ |
Tensor parallelism |
✓ |
Pipeline parallelism |
✓ |
Interleaved Pipeline Parallelism Sched |
N/A |
Sequence parallelism |
✓ |
Selective activation checkpointing |
✓ |
Gradient checkpointing |
✓ |
Partial gradient checkpointing |
✓ |
FP32/TF32 |
✓ |
AMP/FP16 |
✗ |
BF16 |
✓ |
TransformerEngine/FP8 |
✗ |
Multi-GPU |
✓ |
Multi-Node |
✓ |
Inference |
N/A |
Slurm |
✓ |
Base Command Manager |
✓ |
Base Command Platform |
✓ |
Distributed data preprcessing |
✓ |
NVfuser |
✗ |
P-Tuning and Prompt Tuning |
✓ |
IA3 and Adapter learning |
✓ |
Distributed Optimizer |
✓ |
Distributed Checkpoint |
✓ |
Fully Shared Data Parallel |
N/A |