Mistral Medium 3.5
Mistral Medium 3.5 is Mistral AI’s
flagship 128B dense model that merges instruction-following, reasoning,
and coding into a single checkpoint with a configurable reasoning mode.
It unifies the lineage of Mistral Medium 3.1, Magistral Medium, and
Devstral 2 into one model, and ships natively in FP8 (per-tensor
weight_scale_inv) so the full model fits inside an H200 node or 2 ×
H100 nodes — a notable footprint advantage over comparably-capable
Mixture-of-Experts (MoE) systems.
Architecture
Mistral Medium 3.5 is a dense transformer — no MoE routing — built on
the same text backbone as
mistralai/Devstral-2-123B-Instruct-2512:
88 Ministral-3 decoder layers (hidden 12288, 96 attention heads,
8 KV heads, GQA) with the standard llama-style RoPE + RMSNorm + SwiGLU
MLP layout. The multimodal variant adds a Pixtral vision tower and
multi-modal projector on top, making it an
AutoModelForImageTextToText checkpoint.
Compared with MoE models of similar capability, the dense layout trades sparse-activation throughput for a substantially smaller deployment footprint — relevant when you want to fine-tune or serve the model on a single node.
Key Strengths
- Compactness. Dense 128B fits in fewer GPUs than the comparable MoE class — a single H200 node or 2 × H100 nodes for inference.
- Configurable reasoning mode. One checkpoint covers chat, agentic, and reasoning workloads; the reasoning mode is toggled at inference time.
- Strong agentic performance. Competitive on tool-use and decision-making benchmarks; suitable as a base for connector-driven agent workflows.
- Long context. 256k-token window for document parsing and research-assistant use cases.
Trade-offs disclosed in the model card: weaker non-agentic benchmark performance and more verbose outputs than some closed-source competitors.
Use Cases
- Agentic workflows with connectors
- Cloud and local async coding
- Document parsing (multimodal — text + image)
- Research assistants
- General chat
- Base model for downstream fine-tuning
Available Models
- Mistral-Medium-3.5 128B
Class
- HF:
Mistral3ForConditionalGeneration - NeMo AutoModel custom:
Mistral3FP8VLMForConditionalGeneration(source)
The custom class extends HF’s Mistral3ForConditionalGeneration and
attaches a Mistral3FP8StateDictAdapter.for_vlm_full() so the FP8
checkpoint dequantizes per-shard inside the standard DCP load — the
full BF16 model is never materialized on a single rank, allowing TP+PP
training to fit on H100-80GB.
Example HF Models
Example Recipes
Try with NeMo AutoModel
1. Install (full instructions):
2. Clone the repo to get the example recipes:
This recipe was validated on 8 nodes × 8 GPUs (64 H100s) with TP=8 PP=8 DP=1. See the Launcher Guide for multi-node setup. Inference / single-node fine-tune fits in 1 × H200 or 2 × H100 nodes thanks to the dense + FP8 layout.
3. Run the recipe via Slurm (see the fine-tuning guide for a complete launch script):
Run with Docker
1. Pull the container and mount a checkpoint directory:
2. Navigate to the AutoModel directory (where the recipes are):
3. Run the recipe:
See the Installation Guide and the Mistral Medium 3.5 Fine-Tuning Guide.
Fine-Tuning
See the Mistral Medium 3.5 Fine-Tuning Guide.
Hugging Face Model Cards
- mistralai
- Related architecture:
mistralai/Devstral-2-123B-Instruct-2512