nemo_automodel.components.models.deepseek_v4.config
nemo_automodel.components.models.deepseek_v4.config
Module Contents
Classes
API
Bases: PretrainedConfig
Configuration class for DeepSeek V4.
DeepSeek V4 differs from V3/V3.2 in several key ways:
- Attention: GQA (num_key_value_heads=1) with Q-LoRA and grouped O-LoRA instead of MLA.
- No dense MLP layers: all transformer blocks use MoE FFN.
- Per-layer sliding/compressed attention via compress_ratios.
- First num_hash_layers use hash-clustering (HC) attention for dynamic token grouping.
- Learnable attention sink token for sliding-window layers.
- New MoE gate scoring: sqrtsoftplus with noaux_tc routing.
- Next-n prediction (MTP) layers for multi-token prediction.
compress_ratios
keys_to_ignore_at_inference
model_type