nemo_automodel.components.models.ling_v2.config#
Configuration for BailingMoeV2 (Ling 2.0 family: Ling-mini, Ling-flash, Ling-1T).
Mirrors the BailingMoeV2Config shipped in the official HuggingFace checkpoints’
configuration_bailing_moe_v2.py. Registered against AutoConfig so that
AutoConfig.from_pretrained(...) resolves without trust_remote_code.
Module Contents#
Classes#
Configuration class for the BailingMoeV2 model (Ling 2.0). |
API#
- class nemo_automodel.components.models.ling_v2.config.BailingMoeV2Config(
- vocab_size: int = 157184,
- hidden_size: int = 2048,
- intermediate_size: int = 5120,
- num_hidden_layers: int = 20,
- num_attention_heads: int = 16,
- num_key_value_heads: int = 4,
- hidden_act: str = 'silu',
- use_qkv_bias: bool = False,
- use_bias: bool = False,
- rms_norm_eps: float = 1e-06,
- tie_word_embeddings: bool = False,
- embedding_dropout: float = 0.0,
- attention_dropout: float = 0.0,
- output_dropout: float = 0.0,
- initializer_range: float = 0.02,
- max_position_embeddings: int = 32768,
- rope_theta: float = 600000.0,
- use_cache: bool = True,
- max_window_layers: int = 20,
- rope_scaling: dict | None = None,
- pad_token_id: int = 156892,
- eos_token_id: int = 156892,
- num_experts: int = 256,
- num_shared_experts: int = 1,
- num_experts_per_tok: int = 8,
- n_group: int = 8,
- topk_group: int = 4,
- moe_intermediate_size: int = 512,
- first_k_dense_replace: int = 1,
- head_dim: int = 128,
- output_router_logits: bool = False,
- use_qk_norm: bool = True,
- partial_rotary_factor: float = 1.0,
- num_nextn_predict_layers: int = 0,
- mtp_loss_scaling_factor: float = 0,
- moe_router_enable_expert_bias: bool = True,
- routed_scaling_factor: float = 1.0,
- norm_topk_prob: bool = True,
- score_function: str = 'sigmoid',
- rotary_dim: int | None = None,
- **kwargs,
Bases:
transformers.configuration_utils.PretrainedConfigConfiguration class for the BailingMoeV2 model (Ling 2.0).
The defaults reflect the
Ling-mini-2.0(16B-A1.4B) variant. Larger variants (Ling-flash-2.0100B-A6B andLing-1T1T-A50B) override sizing knobs but share the same architecture: GQA attention with per-head QK-RMSNorm, partial RoPE, sigmoid-routed grouped MoE with shared experts, andfirst_k_dense_replacedense MLP layers at the start.Initialization
- model_type#
‘bailing_moe’
- keys_to_ignore_at_inference#
[‘past_key_values’]