Mistral Medium 3.5#

Mistral Medium 3.5 is Mistral AI’s flagship 128B dense model that merges instruction-following, reasoning, and coding into a single checkpoint with a configurable reasoning mode. It unifies the lineage of Mistral Medium 3.1, Magistral Medium, and Devstral 2 into one model, and ships natively in FP8 (per-tensor weight_scale_inv) so the full model fits inside an H200 node or 2 × H100 nodes — a notable footprint advantage over comparably-capable Mixture-of-Experts (MoE) systems.


Task	Image-Text-to-Text
Architecture	`Mistral3ForConditionalGeneration` (Pixtral vision tower + dense Ministral-3 text decoder)
Parameters	128B (dense, FP8 on disk)
Context Window	256k tokens
Languages	40+ (English, French, Spanish, German, Russian, Chinese, Japanese, Italian, Portuguese, Arabic, Hindi, Korean, plus Indic / Nordic / Eastern European tail)
License	Modified MIT (open-weights, ≤ $20M annual revenue threshold)
HF Org	mistralai

Architecture#

Mistral Medium 3.5 is a dense transformer — no MoE routing — built on the same text backbone as mistralai/Devstral-2-123B-Instruct-2512: 88 Ministral-3 decoder layers (hidden 12288, 96 attention heads, 8 KV heads, GQA) with the standard llama-style RoPE + RMSNorm + SwiGLU MLP layout. The multimodal variant adds a Pixtral vision tower and multi-modal projector on top, making it an AutoModelForImageTextToText checkpoint.

Compared with MoE models of similar capability, the dense layout trades sparse-activation throughput for a substantially smaller deployment footprint — relevant when you want to fine-tune or serve the model on a single node.

Key Strengths#

Compactness. Dense 128B fits in fewer GPUs than the comparable MoE class — a single H200 node or 2 × H100 nodes for inference.
Configurable reasoning mode. One checkpoint covers chat, agentic, and reasoning workloads; the reasoning mode is toggled at inference time.
Strong agentic performance. Competitive on tool-use and decision-making benchmarks; suitable as a base for connector-driven agent workflows.
Long context. 256k-token window for document parsing and research-assistant use cases.

Trade-offs disclosed in the model card: weaker non-agentic benchmark performance and more verbose outputs than some closed-source competitors.

Use Cases#

Agentic workflows with connectors
Cloud and local async coding
Document parsing (multimodal — text + image)
Research assistants
General chat
Base model for downstream fine-tuning

Available Models#

Mistral-Medium-3.5 128B

Class#

HF: Mistral3ForConditionalGeneration
NeMo AutoModel custom: Mistral3FP8VLMForConditionalGeneration (source)

The custom class extends HF’s Mistral3ForConditionalGeneration and attaches a Mistral3FP8StateDictAdapter.for_vlm_full() so the FP8 checkpoint dequantizes per-shard inside the standard DCP load — the full BF16 model is never materialized on a single rank, allowing TP+PP training to fit on H100-80GB.

Example HF Models#

Model	HF ID
Mistral Medium 3.5 128B	`mistralai/Mistral-Medium-3.5`

Example Recipes#

Recipe	Dataset	Description
`mistral3p5_128b_medpix.yaml`	MedPix-VQA	SFT — Mistral Medium 3.5 128B on MedPix, 8 nodes × 8 GPUs (TP=8 PP=8)

Try with NeMo AutoModel#

1. Install (full instructions):

pip install nemo-automodel

2. Clone the repo to get the example recipes:

git clone https://github.com/NVIDIA-NeMo/Automodel.git
cd Automodel

Note

This recipe was validated on 8 nodes × 8 GPUs (64 H100s) with TP=8 PP=8 DP=1. See the Launcher Guide for multi-node setup. Inference / single-node fine-tune fits in 1 × H200 or 2 × H100 nodes thanks to the dense + FP8 layout.

3. Run the recipe via Slurm (see the fine-tuning guide for a complete launch script):

sbatch your_slurm_script.sub

See the Installation Guide and the Mistral Medium 3.5 Fine-Tuning Guide.

Fine-Tuning#

See the Mistral Medium 3.5 Fine-Tuning Guide.

Hugging Face Model Cards#

mistralai
Related architecture: mistralai/Devstral-2-123B-Instruct-2512