DeepSeek V3#

Released in December 2024, DeepSeek V3 advances the Mixture-of-Experts architecture with cutting-edge innovations such as auxiliary-loss-free load balancing and node-limited routing. The model also features multi-token prediction and FP8 training. DeepSeek V3 has 37B active parameters out of 671B total parameters.

We provide pre-defined recipes for pretraining and finetuning DeepSeek V3 models using NeMo 2.0 and NeMo-Run. These recipes configure a run.Partial for one of the nemo.collections.llm api functions introduced in NeMo 2.0. The recipes are hosted in deepseek_v3

NeMo 2.0 Pretraining Recipes#

Note

The pretraining recipes use the MockDataModule for the data argument. You are expected to replace the MockDataModule with your custom dataset.

By default, DeepSeek V3 pretraining uses the Multi-Token Prediction (MTP) module. If you don’t want to use MTP, set use_mtp=False.

We provide an example below on how to invoke the default recipe and override the data argument:

from nemo.collections import llm

recipe = llm.deepseek_v3.pretrain_recipe(
    name="deepseek_v3_pretraining",
    dir=f"/path/to/checkpoints",
    num_nodes=128,
    num_gpus_per_node=8,
    use_mtp=True,
)

# # To override the data argument
# dataloader = a_function_that_configures_your_custom_dataset(
#     gbs=gbs,
#     mbs=mbs,
#     seq_length=recipe.model.config.seq_length,
# )
# recipe.data = dataloader

Note that the full DeepSeek-V3 pretraining recipe requires 128 DGX H100 nodes. To facilitate debugging, we provide a proxy recipe which modifies the architecture to have only 2 layers. This proxy model will not converge, but it can be run on 1 node for testing purposes.

from nemo.collections import llm

recipe = llm.deepseek_v3.pretrain_recipe(
    name="deepseek_v3_2layer_pretraining",
    dir=f"/path/to/checkpoints",
    num_nodes=1,
    num_gpus_per_node=8,
    use_mtp=True,
)

# Arguments for modified 2-layer DeepSeek V3 model.
recipe.model.config.num_layers = 2
recipe.model.config.moe_layer_freq = [0, 1]
recipe.trainer.strategy.pipeline_model_parallel_size = 1
recipe.trainer.strategy.expert_model_parallel_size = 8
recipe.trainer.strategy.virtual_pipeline_model_parallel_size = None
recipe.trainer.strategy.expert_tensor_parallel_size = 1

NeMo 2.0 Finetuning Recipes#

Note

The finetuning recipes use the SquadDataModule for the data argument. You can replace the SquadDataModule with your custom dataset.

To import the HF model and convert to NeMo 2.0 format, DeepSeek V3 is slightly more involved than other models since HF does not officially support the FP8 quantization scheme. Here are the steps you need to follow to create a BF16 HF checkpoint before converting to NeMo 2.0.

# clone DeepSeek V3 weights from HF  (This can take hours)
git lfs install
git clone https://huggingface.co/deepseek-ai/DeepSeek-V3-Base

# clone DeepSeek-V3 code
git clone https://github.com/deepseek-ai/DeepSeek-V3.git

# make a modification for the latest version of transformers
cd DeepSeek-V3/inference
sed -i '88{s/new_safetensor_file/new_safetensor_file, metadata={"format": "pt"}/}' fp8_cast_bf16.py

# convert weights
python fp8_cast_bf16.py --input-fp8-hf-path ../../DeepSeek-V3-Base --output-bf16-hf-path ../../DeepSeek-V3-Base-BF16

# copy other files
cd ../..
cp DeepSeek-V3-Base/{tokenizer_config.json,tokenizer.json,modeling_deepseek.py,configuration_deepseek.py} DeepSeek-V3-Base-BF16/

# copy config.json and remove `quantization_config`:
jq 'del(.quantization_config)' DeepSeek-V3-Base/config.json > DeepSeek-V3-Base-BF16/config.json

from nemo.collections import llm
llm.import_ckpt(
    model=llm.DeepSeekModel(llm.DeepSeekV3Config()),
    source="hf:///absolute/path/to/DeepSeek-V3-Base-BF16",
    output_path="<NEMO_HOME>/models/deepseek-ai/DeepSeek-V3-Base",
)

By default, the non-instruct version of the model is loaded. To load a different model, set finetune.resume.restore_config.path=nemo://<hf model id> or finetune.resume.restore_config.path=<local model path>

We provide an example below on how to invoke the default recipe and override the data argument:

from nemo.collections import llm

recipe = llm.deepseek_v3.finetune_recipe(
    name="deepseek_v3_finetuning",
    dir=f"/path/to/checkpoints",  # log dir
    num_nodes=6,
    num_gpus_per_node=8,
    peft_scheme='lora',  # 'lora', 'dora', 'none'
    # resume_path="absolute/path/to/DeepSeek-V3-Base-BF16",  # only needed if your checkpoint is not "<NEMO_HOME>/models/deepseek-ai/DeepSeek-V3-Base"
)

# # To override the data argument
# dataloader = a_function_that_configures_your_custom_dataset(
#     gbs=gbs,
#     mbs=mbs,
#     seq_length=recipe.model.config.seq_length,
# )
# recipe.data = dataloader

By default, the finetuning recipe will run LoRA finetuning with LoRA applied to all linear layers in MLA (none in the MoE layer). To finetune the entire model without LoRA, set peft_scheme='none' in the recipe argument.

Note

The configuration in the recipes is done using the NeMo-Run run.Config and run.Partial configuration objects. Please review the NeMo-Run documentation to learn more about its configuration and execution system.

Once you have your final configuration ready, you can execute it on any of the NeMo-Run supported executors. The simplest is the local executor, which just runs the pretraining locally in a separate process. You can use it as follows:

import nemo_run as run

run.run(recipe, executor=run.LocalExecutor())

Additionally, you can also run it directly in the same Python process as follows:

run.run(recipe, direct=True)