`nemo_automodel.components.models.ling_v2.config`#

Configuration for BailingMoeV2 (Ling 2.0 family: Ling-mini, Ling-flash, Ling-1T).

Mirrors the BailingMoeV2Config shipped in the official HuggingFace checkpoints’ configuration_bailing_moe_v2.py. Registered against AutoConfig so that AutoConfig.from_pretrained(...) resolves without trust_remote_code.

Module Contents#

Classes#

BailingMoeV2Config

Configuration class for the BailingMoeV2 model (Ling 2.0).

API#

class nemo_automodel.components.models.ling_v2.config.BailingMoeV2Config(

vocab_size: int = 157184,

hidden_size: int = 2048,

intermediate_size: int = 5120,

num_hidden_layers: int = 20,

num_attention_heads: int = 16,

num_key_value_heads: int = 4,

hidden_act: str = 'silu',

use_qkv_bias: bool = False,

use_bias: bool = False,

rms_norm_eps: float = 1e-06,

tie_word_embeddings: bool = False,

embedding_dropout: float = 0.0,

attention_dropout: float = 0.0,

output_dropout: float = 0.0,

initializer_range: float = 0.02,

max_position_embeddings: int = 32768,

rope_theta: float = 600000.0,

use_cache: bool = True,

max_window_layers: int = 20,

rope_scaling: dict | None = None,

pad_token_id: int = 156892,

eos_token_id: int = 156892,

num_experts: int = 256,

num_shared_experts: int = 1,

num_experts_per_tok: int = 8,

n_group: int = 8,

topk_group: int = 4,

moe_intermediate_size: int = 512,

first_k_dense_replace: int = 1,

head_dim: int = 128,

output_router_logits: bool = False,

use_qk_norm: bool = True,

partial_rotary_factor: float = 1.0,

num_nextn_predict_layers: int = 0,

mtp_loss_scaling_factor: float = 0,

moe_router_enable_expert_bias: bool = True,

routed_scaling_factor: float = 1.0,

norm_topk_prob: bool = True,

score_function: str = 'sigmoid',

rotary_dim: int | None = None,

**kwargs,

)#

Bases: transformers.configuration_utils.PretrainedConfig

Configuration class for the BailingMoeV2 model (Ling 2.0).

The defaults reflect the Ling-mini-2.0 (16B-A1.4B) variant. Larger variants (Ling-flash-2.0 100B-A6B and Ling-1T 1T-A50B) override sizing knobs but share the same architecture: GQA attention with per-head QK-RMSNorm, partial RoPE, sigmoid-routed grouped MoE with shared experts, and first_k_dense_replace dense MLP layers at the start.

Initialization

model_type#: ‘bailing_moe’

keys_to_ignore_at_inference#: [‘past_key_values’]

nemo_automodel.components.models.ling_v2.config#

Module Contents#

Classes#

API#

`nemo_automodel.components.models.ling_v2.config`#