DeepSeek-V3
DeepSeek-V3
DeepSeek-V3 is a large-scale Mixture-of-Experts model with 671B total parameters and 37B activated per token. It features Multi-head Latent Attention (MLA), innovative load balancing, and Multi-Token Prediction (MTP). DeepSeek-V3.2 is an updated release with further improvements.
Moonlight by Moonshot AI also uses this architecture with 16B total / 3B activated parameters.
Available Models
- DeepSeek-V3: 671B total, 37B activated
- DeepSeek-V3.2 (
DeepseekV32ForCausalLM): updated architecture - Moonlight-16B-A3B (Moonshot AI): 16B total, 3B activated
Architectures
DeepseekV3ForCausalLMDeepseekV32ForCausalLM
Example HF Models
Example Recipes
Try with NeMo AutoModel
1. Install (full instructions):
2. Clone the repo to get the example recipes:
This recipe was validated on 32 nodes × 8 GPUs (256 H100s). See the Launcher Guide for multi-node setup.
3. Run the recipe from inside the repo:
Run with Docker
1. Pull the container and mount a checkpoint directory:
2. Navigate to the AutoModel directory (where the recipes are):
3. Run the recipe:
See the Installation Guide and LLM Fine-Tuning Guide.
Fine-Tuning
See the LLM Fine-Tuning Guide and the Large MoE Fine-Tuning Guide.