Ling 2.0
Ling 2.0 is the Mixture-of-Experts LLM family from inclusionAI (Ant Group), released under the bailing_moe HF architecture (BailingMoeV2ForCausalLM). The line spans a 16 B mini through a 1 T flagship while sharing the same architecture.
Available Models
- Ling-mini-2.0: 16 B total / ~1.4 B activated per token (20 layers, 256 experts, 8 activated).
- Ling-flash-2.0: 100 B total / ~6 B activated per token (32 layers, 256 experts, 8 activated).
- Ling-1T: 1 T total / ~50 B activated per token (80 layers,
first_k_dense_replace=4). - Ling-mini-base-2.0 / Ling-flash-base-2.0: base (pre-instruct) variants.
All variants share the same architecture: GQA + per-head QK-RMSNorm + half RoPE (partial_rotary_factor=0.5) + sigmoid-routed grouped MoE with one shared expert and a per-expert correction bias (aux-loss-free routing).
Architecture
BailingMoeV2ForCausalLM(HFmodel_type: "bailing_moe")- GQA attention;
use_qk_norm: true - Half RoPE (
partial_rotary_factor=0.5) - DeepSeek-V3-style routing: sigmoid scoring, per-expert bias, grouped top-k (
n_group=8,topk_group=4) - 1 shared expert at
moe_intermediate_size first_k_dense_replacedense MLP layer(s) at the start of the stack
Example HF Models
Example Recipes
Try with NeMo AutoModel
1. Install (full instructions).
2. Run LoRA fine-tuning:
A single 80 GB H100 / A100 fits Ling-mini-2.0 in bf16 with the LoRA defaults in the example. Set distributed.ep_size > 1 for multi-GPU expert parallelism on the larger variants.