nemo_automodel.components.models.ling_v2.config#

Configuration for BailingMoeV2 (Ling 2.0 family: Ling-mini, Ling-flash, Ling-1T).

Mirrors the BailingMoeV2Config shipped in the official HuggingFace checkpoints’ configuration_bailing_moe_v2.py. Registered against AutoConfig so that AutoConfig.from_pretrained(...) resolves without trust_remote_code.

Module Contents#

Classes#

BailingMoeV2Config

Configuration class for the BailingMoeV2 model (Ling 2.0).

API#

class nemo_automodel.components.models.ling_v2.config.BailingMoeV2Config(
vocab_size: int = 157184,
hidden_size: int = 2048,
intermediate_size: int = 5120,
num_hidden_layers: int = 20,
num_attention_heads: int = 16,
num_key_value_heads: int = 4,
hidden_act: str = 'silu',
use_qkv_bias: bool = False,
use_bias: bool = False,
rms_norm_eps: float = 1e-06,
tie_word_embeddings: bool = False,
embedding_dropout: float = 0.0,
attention_dropout: float = 0.0,
output_dropout: float = 0.0,
initializer_range: float = 0.02,
max_position_embeddings: int = 32768,
rope_theta: float = 600000.0,
use_cache: bool = True,
max_window_layers: int = 20,
rope_scaling: dict | None = None,
pad_token_id: int = 156892,
eos_token_id: int = 156892,
num_experts: int = 256,
num_shared_experts: int = 1,
num_experts_per_tok: int = 8,
n_group: int = 8,
topk_group: int = 4,
moe_intermediate_size: int = 512,
first_k_dense_replace: int = 1,
head_dim: int = 128,
output_router_logits: bool = False,
use_qk_norm: bool = True,
partial_rotary_factor: float = 1.0,
num_nextn_predict_layers: int = 0,
mtp_loss_scaling_factor: float = 0,
moe_router_enable_expert_bias: bool = True,
routed_scaling_factor: float = 1.0,
norm_topk_prob: bool = True,
score_function: str = 'sigmoid',
rotary_dim: int | None = None,
**kwargs,
)#

Bases: transformers.configuration_utils.PretrainedConfig

Configuration class for the BailingMoeV2 model (Ling 2.0).

The defaults reflect the Ling-mini-2.0 (16B-A1.4B) variant. Larger variants (Ling-flash-2.0 100B-A6B and Ling-1T 1T-A50B) override sizing knobs but share the same architecture: GQA attention with per-head QK-RMSNorm, partial RoPE, sigmoid-routed grouped MoE with shared experts, and first_k_dense_replace dense MLP layers at the start.

Initialization

model_type#

‘bailing_moe’

keys_to_ignore_at_inference#

[‘past_key_values’]