GLM-4.5V#
GLM-4.5V is a powerful vision-language model built on the GLM-4.5 Air architecture. GLM-4.5V combines a 106B parameter sparse MoE language model with a vision encoder for robust multimodal understanding of images and videos.
GLM-4.5V supports multimodal tasks including image captioning, visual question answering, OCR, video understanding, and general vision-language reasoning. The model leverages Multi-Resolution Rotary Position Embedding (MRoPE) for enhanced spatial understanding.
GLM family models are supported via the Bridge system with auto-detected configuration and weight mapping.
Important
Please update transformers version to 4.57.1 or higher in order to use the GLM-4.5V model.
Available Models#
Vision-Language Models#
GLM-4.5V (
zai-org/GLM-4.5V): 106B parameter vision-language model (based on GLM-4.5 Air)46 decoder layers, 4096 hidden size
96 attention heads, 8 query groups (GQA)
128 MoE experts with shared experts
~12B active parameters per token
Sequence length: 131,072 tokens
Recommended: 32 nodes, 256 GPUs (LoRA/DoRA) or 64 nodes, 512 GPUs (Full SFT)
Model Architecture Features#
GLM-4.5V combines efficient sparse MoE language modeling with multimodal capabilities:
Language Model Features:
Sparse MoE Architecture: 128 routed experts with shared experts for efficient parameter usage
Grouped Query Attention (GQA): Memory-efficient attention with 8 query groups
SiLU Gated Linear Unit: Gated linear units with SiLU activation for improved performance
RMSNorm: Layer normalization without mean centering for faster computation
Multi-Resolution RoPE (MRoPE): Enhanced position embeddings with sections [8, 12, 12] for improved spatial understanding
Extended Context: Supports up to 131,072 tokens
Vision-Language Features:
Vision Encoder: Pre-trained vision encoder for robust visual understanding
Multimodal Integration: Seamless integration of visual and textual information
Image and Video Support: Handles both static images and video inputs
Flexible Image Handling: Supports variable resolution images and multiple images per conversation
Examples#
For checkpoint conversion, inference, finetuning recipes, and step-by-step training guides, see the GLM-4.5V Examples.
Hugging Face Model Cards#
GLM-4.5V: https://huggingface.co/zai-org/GLM-4.5V