Ling 2.0

View as Markdown

Ling 2.0 is the Mixture-of-Experts LLM family from inclusionAI (Ant Group), released under the bailing_moe HF architecture (BailingMoeV2ForCausalLM). The line spans a 16 B mini through a 1 T flagship while sharing the same architecture.

TaskText Generation (MoE)
ArchitectureBailingMoeV2ForCausalLM
Parameters16 B – 1 T total
HF OrginclusionAI

Available Models

  • Ling-mini-2.0: 16 B total / ~1.4 B activated per token (20 layers, 256 experts, 8 activated).
  • Ling-flash-2.0: 100 B total / ~6 B activated per token (32 layers, 256 experts, 8 activated).
  • Ling-1T: 1 T total / ~50 B activated per token (80 layers, first_k_dense_replace=4).
  • Ling-mini-base-2.0 / Ling-flash-base-2.0: base (pre-instruct) variants.

All variants share the same architecture: GQA + per-head QK-RMSNorm + half RoPE (partial_rotary_factor=0.5) + sigmoid-routed grouped MoE with one shared expert and a per-expert correction bias (aux-loss-free routing).

Architecture

  • BailingMoeV2ForCausalLM (HF model_type: "bailing_moe")
  • GQA attention; use_qk_norm: true
  • Half RoPE (partial_rotary_factor=0.5)
  • DeepSeek-V3-style routing: sigmoid scoring, per-expert bias, grouped top-k (n_group=8, topk_group=4)
  • 1 shared expert at moe_intermediate_size
  • first_k_dense_replace dense MLP layer(s) at the start of the stack

Example HF Models

Example Recipes

RecipeDescriptionMin HW
ling_mini_2_0_squad.yamlLoRA SFT — Ling-mini-2.0 on SQuAD2× H100 80GB
ling_mini_2_0_hellaswag.yamlLoRA SFT — Ling-mini-2.0 on HellaSwag2× H100 80GB
ling_mini_2_0_sft.yamlFull SFT — Ling-mini-2.0 on HellaSwag, FSDP2 + EP=88× H100 80GB
ling_flash_2_0_lora.yamlLoRA SFT — Ling-flash-2.0 on HellaSwag8× H100 80GB
ling_flash_2_0_sft.yamlFull SFT — Ling-flash-2.0 on HellaSwag, FSDP2 + EP=3232× H100 80GB (4 nodes)
ling_1t_lora_pp.yamlLoRA SFT — Ling-1T on HellaSwag, FSDP2 + PP=8 + EP=864× H100 80GB (8 nodes)
ling_1t_sft.yamlFull SFT — Ling-1T on HellaSwag, FSDP2 + PP=4 + EP=64256× H100 80GB (32 nodes)

Try with NeMo AutoModel

1. Install (full instructions).

2. Run LoRA fine-tuning:

$automodel examples/llm_finetune/ling/ling_mini_2_0_squad.yaml --nproc-per-node 1

A single 80 GB H100 / A100 fits Ling-mini-2.0 in bf16 with the LoRA defaults in the example. Set distributed.ep_size > 1 for multi-GPU expert parallelism on the larger variants.