nemo_automodel.components.models.deepseek_v4.config

View as Markdown

Module Contents

Classes

NameDescription
DeepseekV4ConfigConfiguration class for DeepSeek V4.

API

class nemo_automodel.components.models.deepseek_v4.config.DeepseekV4Config(
vocab_size: int = 129280,
hidden_size: int = 4096,
moe_intermediate_size: int = 2048,
num_hidden_layers: int = 43,
num_attention_heads: int = 64,
num_key_value_heads: int = 1,
head_dim: int = 512,
qk_rope_head_dim: int = 64,
q_lora_rank: int = 1024,
o_lora_rank: int = 1024,
o_groups: int = 8,
n_routed_experts: int = 256,
n_shared_experts: int = 1,
num_experts_per_tok: int = 6,
routed_scaling_factor: float = 1.5,
norm_topk_prob: bool = True,
scoring_func: str = 'sqrtsoftplus',
topk_method: str = 'noaux_tc',
hidden_act: str = 'silu',
swiglu_limit: float = 10.0,
max_position_embeddings: int = 1048576,
rope_theta: float = 10000.0,
rope_scaling: dict | None = None,
compress_rope_theta: float = 160000.0,
compress_ratios: list | None = None,
sliding_window: int = 128,
num_hash_layers: int = 3,
hc_eps: float = 1e-06,
hc_mult: int = 4,
hc_sinkhorn_iters: int = 20,
index_head_dim: int = 128,
index_n_heads: int = 64,
index_topk: int = 512,
num_nextn_predict_layers: int = 1,
rms_norm_eps: float = 1e-06,
attention_bias: bool = False,
attention_dropout: float = 0.0,
use_cache: bool = True,
pad_token_id: int | None = None,
bos_token_id: int = 0,
eos_token_id: int = 1,
pretraining_tp: int = 1,
tie_word_embeddings: bool = False,
initializer_range: float = 0.02,
torch_dtype: str = 'bfloat16',
kwargs = {}
)

Bases: PretrainedConfig

Configuration class for DeepSeek V4.

DeepSeek V4 differs from V3/V3.2 in several key ways:

  • Attention: GQA (num_key_value_heads=1) with Q-LoRA and grouped O-LoRA instead of MLA.
  • No dense MLP layers: all transformer blocks use MoE FFN.
  • Per-layer sliding/compressed attention via compress_ratios.
  • First num_hash_layers use hash-clustering (HC) attention for dynamic token grouping.
  • Learnable attention sink token for sliding-window layers.
  • New MoE gate scoring: sqrtsoftplus with noaux_tc routing.
  • Next-n prediction (MTP) layers for multi-token prediction.
compress_ratios
= compress_ratios or []
keys_to_ignore_at_inference
= ['past_key_values']
model_type
= 'deepseek_v4'