Ling 2.0

Ling 2.0 is the Mixture-of-Experts LLM family from inclusionAI (Ant Group), released under the bailing_moe HF architecture (BailingMoeV2ForCausalLM). The line spans a 16 B mini through a 1 T flagship while sharing the same architecture.


Task	Text Generation (MoE)
Architecture	`BailingMoeV2ForCausalLM`
Parameters	16 B – 1 T total
HF Org	inclusionAI

Available Models

Ling-mini-2.0: 16 B total / ~1.4 B activated per token (20 layers, 256 experts, 8 activated).
Ling-flash-2.0: 100 B total / ~6 B activated per token (32 layers, 256 experts, 8 activated).
Ling-1T: 1 T total / ~50 B activated per token (80 layers, first_k_dense_replace=4).
Ling-mini-base-2.0 / Ling-flash-base-2.0: base (pre-instruct) variants.

All variants share the same architecture: GQA + per-head QK-RMSNorm + half RoPE (partial_rotary_factor=0.5) + sigmoid-routed grouped MoE with one shared expert and a per-expert correction bias (aux-loss-free routing).

Architecture

BailingMoeV2ForCausalLM (HF model_type: "bailing_moe")
GQA attention; use_qk_norm: true
Half RoPE (partial_rotary_factor=0.5)
DeepSeek-V3-style routing: sigmoid scoring, per-expert bias, grouped top-k (n_group=8, topk_group=4)
1 shared expert at moe_intermediate_size
first_k_dense_replace dense MLP layer(s) at the start of the stack

Example HF Models

Model	HF ID
Ling-mini-2.0	`inclusionAI/Ling-mini-2.0`
Ling-flash-2.0	`inclusionAI/Ling-flash-2.0`
Ling-1T	`inclusionAI/Ling-1T`

Example Recipes

Recipe	Description	Min HW
ling_mini_2_0_squad.yaml	LoRA SFT — Ling-mini-2.0 on SQuAD	2× H100 80GB
ling_mini_2_0_hellaswag.yaml	LoRA SFT — Ling-mini-2.0 on HellaSwag	2× H100 80GB
ling_mini_2_0_sft.yaml	Full SFT — Ling-mini-2.0 on HellaSwag, FSDP2 + EP=8	8× H100 80GB
ling_flash_2_0_lora.yaml	LoRA SFT — Ling-flash-2.0 on HellaSwag	8× H100 80GB
ling_flash_2_0_sft.yaml	Full SFT — Ling-flash-2.0 on HellaSwag, FSDP2 + EP=32	32× H100 80GB (4 nodes)
ling_1t_lora_pp.yaml	LoRA SFT — Ling-1T on HellaSwag, FSDP2 + PP=8 + EP=8	64× H100 80GB (8 nodes)
ling_1t_sft.yaml	Full SFT — Ling-1T on HellaSwag, FSDP2 + PP=4 + EP=64	256× H100 80GB (32 nodes)

Try with NeMo AutoModel

1. Install (full instructions).

2. Run LoRA fine-tuning:

$ automodel examples/llm_finetune/ling/ling_mini_2_0_squad.yaml --nproc-per-node 1

A single 80 GB H100 / A100 fits Ling-mini-2.0 in bf16 with the LoRA defaults in the example. Set distributed.ep_size > 1 for multi-GPU expert parallelism on the larger variants.