DeepSeek V2#

DeepSeek-V2 is a Mixture-of-Experts (MoE) language model that uses innovative Multi-head Latent Attention (MLA) for efficient inference and DeepSeekMoE architecture for economical training and inference. The model achieves performance comparable to GPT-4 while using significantly fewer activated parameters. More information is available in the companion paper “DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model”.

DeepSeek V2 models are supported via the Bridge system with auto-detected configuration and weight mapping.

Available Models#

Megatron Bridge supports the following DeepSeek V2 model variants:

DeepSeek-V2: 236B parameters (21B activated per token)
DeepSeek-V2-Lite: 16B parameters (2.4B activated per token)

Both models support pretraining with expert parallelism for efficient MoE training.

Model Architecture Features#

Multi-head Latent Attention (MLA): Novel attention mechanism that reduces KV cache requirements
DeepSeekMoE: Efficient MoE architecture with routed and shared experts
Expert Parallelism: Distributes experts across GPUs for scalable training
RoPE Embeddings: Rotary position embeddings for position encoding
128K Context Length: Native support for long sequences (DeepSeek-V2)
Pre-normalization: RMSNorm before each transformer sub-layer

Conversion with 🤗 Hugging Face#

Load HF → Megatron#

from megatron.bridge import AutoBridge

# Example: DeepSeek-V2-Lite
bridge = AutoBridge.from_hf_pretrained("deepseek-ai/DeepSeek-V2-Lite", trust_remote_code=True)
provider = bridge.to_megatron_provider()

# Optionally configure parallelism before instantiating the model
provider.tensor_model_parallel_size = 1
provider.pipeline_model_parallel_size = 1
provider.expert_model_parallel_size = 8

model = provider.provide_distributed_model(wrap_with_ddp=False)

Import Checkpoint from HF#

python examples/conversion/convert_checkpoints.py import \
  --hf-model deepseek-ai/DeepSeek-V2-Lite \
  --megatron-path /checkpoints/deepseek_v2_lite_megatron \
  --trust-remote-code

Export Megatron → HF#

from megatron.bridge import AutoBridge

# Load the bridge from HF model ID
bridge = AutoBridge.from_hf_pretrained("deepseek-ai/DeepSeek-V2-Lite", trust_remote_code=True)

# Export a trained Megatron checkpoint to HF format
bridge.export_ckpt(
    megatron_path="/results/deepseek_v2_lite/checkpoints/iter_0000500",
    hf_path="/exports/deepseek_v2_lite_hf",
)

Run Inference on Converted Checkpoint#

python examples/conversion/hf_to_megatron_generate_text.py \
  --hf_model_path deepseek-ai/DeepSeek-V2-Lite \
  --megatron_model_path /checkpoints/deepseek_v2_lite_megatron \
  --prompt "What is artificial intelligence?" \
  --max_new_tokens 100 \
  --ep 8 \
  --trust-remote-code

For more details, see examples/conversion/hf_to_megatron_generate_text.py

Recipes#

See: bridge.recipes.deepseek.deepseek_v2

Available Recipes#

Pretrain recipes:
- deepseek_v2_lite_pretrain_config: Pre-training for DeepSeek-V2-Lite (16B parameters, 2.4B activated per token)
- deepseek_v2_pretrain_config: Pre-training for DeepSeek-V2 (236B parameters, 21B activated per token)

Parallelism Configurations#

Model	TP	PP	EP	Total GPUs	Use Case
DeepSeek-V2-Lite	1	1	8	8	Pre-training (single node)
DeepSeek-V2	1	4	32	128	Pre-training (16 nodes)

Key Features:

Expert Parallelism: EP=8 (V2-Lite) or EP=32 (V2) for efficient MoE training
Selective Recomputation: Enabled by default for memory optimization
Sequence Length: Default 4096, V2 supports up to 128K tokens

Pre-training Examples#

DeepSeek-V2-Lite (16B)#

from megatron.bridge.recipes.deepseek import deepseek_v2_lite_pretrain_config

config = deepseek_v2_lite_pretrain_config(
    name="deepseek_v2_lite_pretrain",
    data_paths=["/path/to/dataset.nvjsonl"],
    dir="/results/deepseek_v2_lite",
    train_iters=500_000,
    global_batch_size=512,
    seq_length=4096,
    # Uses TP=1, PP=1, EP=8 (8 GPUs) automatically
)

DeepSeek-V2 (236B)#

from megatron.bridge.recipes.deepseek import deepseek_v2_pretrain_config

config = deepseek_v2_pretrain_config(
    name="deepseek_v2_pretrain",
    data_paths=["/path/to/dataset.nvjsonl"],
    dir="/results/deepseek_v2",
    train_iters=500_000,
    global_batch_size=512,
    seq_length=4096,
    # Uses TP=1, PP=4, EP=32 (128 GPUs) automatically
)

Finetuning Recipes#

Finetuning recipes for DeepSeek V2 models are not currently available.

Hugging Face Model Cards & References#

Hugging Face Model Cards#

DeepSeek-V2: https://huggingface.co/deepseek-ai/DeepSeek-V2
DeepSeek-V2-Lite: https://huggingface.co/deepseek-ai/DeepSeek-V2-Lite
DeepSeek-V2-Chat: https://huggingface.co/deepseek-ai/DeepSeek-V2-Chat
DeepSeek-V2-Lite-Chat: https://huggingface.co/deepseek-ai/DeepSeek-V2-Lite-Chat

Technical Papers#

DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model: arXiv:2405.04434

Additional Resources#

GitHub Repository: https://github.com/deepseek-ai/DeepSeek-V2