> For clean Markdown of any page, append .md to the page URL.
> For a complete documentation index, see https://docs.nvidia.com/nemo/automodel/llms.txt.
> For AI client integration (Claude Code, Cursor, etc.), connect to the MCP server at https://docs.nvidia.com/nemo/automodel/_mcp/server.

# nemo_automodel.components.models.deepseek_v4.config

## Module Contents

### Classes

| Name                                                                                        | Description                          |
| ------------------------------------------------------------------------------------------- | ------------------------------------ |
| [`DeepseekV4Config`](#nemo_automodel-components-models-deepseek_v4-config-DeepseekV4Config) | Configuration class for DeepSeek V4. |

### API

```python
class nemo_automodel.components.models.deepseek_v4.config.DeepseekV4Config(
    vocab_size: int = 129280,
    hidden_size: int = 4096,
    moe_intermediate_size: int = 2048,
    num_hidden_layers: int = 43,
    num_attention_heads: int = 64,
    num_key_value_heads: int = 1,
    head_dim: int = 512,
    qk_rope_head_dim: int = 64,
    q_lora_rank: int = 1024,
    o_lora_rank: int = 1024,
    o_groups: int = 8,
    n_routed_experts: int = 256,
    n_shared_experts: int = 1,
    num_experts_per_tok: int = 6,
    routed_scaling_factor: float = 1.5,
    norm_topk_prob: bool = True,
    scoring_func: str = 'sqrtsoftplus',
    topk_method: str = 'noaux_tc',
    hidden_act: str = 'silu',
    swiglu_limit: float = 10.0,
    max_position_embeddings: int = 1048576,
    rope_theta: float = 10000.0,
    rope_scaling: dict | None = None,
    compress_rope_theta: float = 160000.0,
    compress_ratios: list | None = None,
    sliding_window: int = 128,
    num_hash_layers: int = 3,
    hc_eps: float = 1e-06,
    hc_mult: int = 4,
    hc_sinkhorn_iters: int = 20,
    index_head_dim: int = 128,
    index_n_heads: int = 64,
    index_topk: int = 512,
    num_nextn_predict_layers: int = 1,
    rms_norm_eps: float = 1e-06,
    attention_bias: bool = False,
    attention_dropout: float = 0.0,
    use_cache: bool = True,
    pad_token_id: int | None = None,
    bos_token_id: int = 0,
    eos_token_id: int = 1,
    pretraining_tp: int = 1,
    tie_word_embeddings: bool = False,
    initializer_range: float = 0.02,
    torch_dtype: str = 'bfloat16',
    kwargs = {}
)
```

**Bases:** `PretrainedConfig`

Configuration class for DeepSeek V4.

DeepSeek V4 differs from V3/V3.2 in several key ways:

* Attention: GQA (num\_key\_value\_heads=1) with Q-LoRA and grouped O-LoRA instead of MLA.
* No dense MLP layers: all transformer blocks use MoE FFN.
* Per-layer sliding/compressed attention via compress\_ratios.
* First num\_hash\_layers use hash-clustering (HC) attention for dynamic token grouping.
* Learnable attention sink token for sliding-window layers.
* New MoE gate scoring: sqrtsoftplus with noaux\_tc routing.
* Next-n prediction (MTP) layers for multi-token prediction.