Important

NeMo 2.0 is an experimental feature and currently released in the dev container only: nvcr.io/nvidia/nemo:dev. Please refer to NeMo 2.0 overview for information on getting started.

Mixtral

Released in December 2023, Mistral AI’s second marquee model, Mixtral-8x7B, is one of the first performant and open-source (Apache 2.0) Sparse Mixture of Experts Model (SMoE). The key distinguishing feature of Mixtral’s SMoE implementation, compared to Mistral 7B, is the inclusion of a router network that guides tokens through a set of two groups of parameters (experts) of a possible eight. This allows the model to perform better and be significantly larger without a corresponding significant increase in cost and latency. More specific details are available in the companion paper “Mixtral of Experts”.

Released in April 2024, Mistral AI’s second SMoE model, Mixtral-8x22B sets a new standard for performance and efficiency within the AI community. It is a sparse Mixture-of-Experts (SMoE) model that uses only 39B active parameters out of 141B, offering unparalleled cost efficiency for its size “announcement page”.

In the following documentation pages we use the terms “mixtral” and “mixtral_8x22b” to refer to the Mixtral-8x7B and Mixtral-8x22B models, respectively.

NeMo 2.0 Pretraining Recipes

We provide recipes for pretraining Mixtral models for three sizes: 8x3B, 8x7B, and 8x22B using NeMo 2.0 and NeMo-Run. These recipes configure a run.Partial for one of the nemo.collections.llm api functions introduced in NeMo 2.0. The recipes are hosted in mixtral_8x3b, mixtral_8x7b and mixtral_8x22b files.

Note

The pretraining recipes use the MockDataModule for the data argument. You are expected to replace the MockDataModule with your own custom dataset.

We provide an example below on how to invoke the default recipe and override the data argument:

from nemo.collections import llm

pretrain = llm.mixtral_8x3b.pretrain_recipe(
    name="mixtral_8x3b_pretraining",
    ckpt_dir=f"/path/to/checkpoints",
    num_nodes=2,
    num_gpus_per_node=8,
)

dataloader = a_function_that_configures_your_custom_dataset(
    gbs=gbs,
    mbs=mbs,
    seq_length=pretrain.model.config.seq_length,
)
pretrain.data = dataloader

Note

The configuration in the recipes is done using the NeMo-Run run.Config and run.Partial configuration objects. Please review the NeMo-Run documentation to learn more about its configuration and execution system.

Once you have your final configuration ready, you can execute it on any of the NeMo-Run supported executors. The simplest is the local executor, which just runs the pretraining locally in a separate process. You can use it as follows:

import nemo_run as run

run.run(pretrain, executor=run.LocalExecutor())

Additionally, you can also run it directly in the same Python process as follows:

run.run(pretrain, direct=True)

A comprehensive list of pretraining recipes that we currently support or plan to support soon is provided below for reference:

Recipe	Status
Mixtral 8x3B	Yes
Mixtral 8x3B FP8	N/A
Mixtral 8x3B 16k	Yes
Mixtral 8x3B 64k	Yes
Mixtral 8x7B	Yes
Mixtral 8x7B FP8	N/A
Mixtral 8x7B 16k	Yes
Mixtral 8x7B 64k	Yes
Mixtral 8x22B	Yes
Mixtral 8x22B FP8	N/A
Mixtral 8x22B 16k	N/A
Mixtral 8x22B 64k	N/A