Ministral 3#
Mistral AI’s Ministral 3 is a family of edge-optimized vision-language models designed for deployment across various hardware configurations. The Ministral 3 architecture combines a powerful language model with a vision encoder for multimodal understanding.
Ministral 3 models support multimodal tasks including image captioning, visual question answering, OCR, and general vision-language understanding. Despite their compact size, these models deliver strong performance for on-device and edge deployment scenarios.
Ministral family models are supported via the Bridge system with auto-detected configuration and weight mapping.
Important
Please upgrade to transformers v5 and upgrade mistral-common in order to use the Ministral 3 models.
Available Models#
Vision-Language Models#
Ministral 3 3B (
mistralai/Ministral-3-3B-Base-2512): 3.4B parameter vision-language model26 layers, 3072 hidden size
32 attention heads, 8 query groups (GQA)
Vision encoder: ~0.4B parameters
Recommended: 1 node, 8 GPUs
Ministral 3 8B (
mistralai/Ministral-3-8B-Base-2512): 8.4B parameter vision-language model34 layers, 4096 hidden size
32 attention heads, 8 query groups (GQA)
Vision encoder: ~0.4B parameters
Recommended: 1 node, 8 GPUs
Ministral 3 14B (
mistralai/Ministral-3-14B-Base-2512): ~14B parameter vision-language model40 layers, 5120 hidden size
32 attention heads, 8 query groups (GQA)
Vision encoder: ~0.4B parameters
Recommended: 1 node, 8 GPUs
All models support extended context lengths up to 256K tokens using YaRN RoPE scaling.
Model Architecture Features#
Ministral 3 combines efficient language modeling with multimodal capabilities:
Language Model Features:
YaRN RoPE Scaling: Advanced rope scaling for extended context lengths (up to 256K tokens)
Grouped Query Attention (GQA): Memory-efficient attention mechanism with 8 query groups
SwiGLU Activation: Gated linear units with SiLU activation for improved performance
RMSNorm: Layer normalization without mean centering for faster computation
Llama 4 Attention Scaling: Position-dependent attention scaling for improved long-context handling
Vision-Language Features:
Vision Encoder: Pre-trained vision encoder for robust visual understanding
Multimodal Projector: Projects vision features to language model space
Flexible Image Handling: Supports variable resolution images and multiple images per conversation
Examples#
For checkpoint conversion, inference, finetuning recipes, and step-by-step training guides, see the Ministral 3 Examples.
Hugging Face Model Cards#
Ministral 3 3B Base: https://huggingface.co/mistralai/Ministral-3-3B-Base-2512
Ministral 3 3B Instruct: https://huggingface.co/mistralai/Ministral-3-3B-Instruct-2512
Ministral 3 8B Base: https://huggingface.co/mistralai/Ministral-3-8B-Base-2512
Ministral 3 8B Instruct: https://huggingface.co/mistralai/Ministral-3-8B-Instruct-2512
Ministral 3 14B Base: https://huggingface.co/mistralai/Ministral-3-14B-Base-2512
Ministral 3 14B Instruct: https://huggingface.co/mistralai/Ministral-3-14B-Instruct-2512