Fine-Tune Gemma 4 31B on CORD-v2 Receipts — End-to-End Guide
Fine-Tune Gemma 4 31B on CORD-v2 Receipts — End-to-End Guide
A step-by-step guide for fine-tuning Gemma 4 31B to extract structured receipt data from scanned images using NeMo AutoModel.
What is Gemma 4 31B?
Gemma 4 31B is a dense vision-language model with a 60-layer transformer decoder, SigLIP vision encoder, and support for multimodal inputs (images, audio, text).
Key architectural details:
- Mixed attention: sliding window (512 tokens) + full attention (every 6th layer)
- 32 attention heads, 16 KV heads (GQA)
- Hidden dim 5376, vocab size 262,144
- bfloat16, final logit softcapping at 30.0
- Thinking-channel support (
<|channel>thought\n<channel|>prefix)
The Task
We fine-tune Gemma 4 31B on the CORD-v2 (Consolidated Receipt Dataset) to extract structured fields from scanned receipts:
The base model produces free-form descriptions. After fine-tuning, it outputs structured XML-like token sequences matching the receipt fields.
Guide Overview
Hardware Requirements
- 8x A100 80 GB (or 8x H100) GPUs required for 31B with FSDP2 + activation checkpointing
- Estimated training time: ~45 min on 8x H100 (800 training samples, 500 steps)
Step 0 — Environment Setup
This guide runs inside the NeMo AutoModel Docker container:
Note: Gemma 4 requires a
transformersversion that includes the model implementation. Please make sure proper transformers is installed.
Explore the CORD-v2 Dataset
CORD-v2 is a Consolidated Receipt Dataset for Post-OCR Parsing containing scanned receipts with structured ground-truth JSON labels.
Expected output:
Target Format: JSON-to-Token Conversion
NeMo AutoModel converts structured JSON into an XML-like token sequence using
the json2token() function. This is the format the model is trained to produce:
Expected output:
Evaluate the Base Model (Before Fine-Tuning)
Load the pretrained Gemma 4 31B model and run it on receipt images. The base model will produce free-form descriptions instead of structured token sequences.
Expected base model output (receipt image):
Example base model prediction (free-form, not structured):
The base model produces readable descriptions but not the structured token format we need. Fine-tuning teaches it to output
<s_menu><s_nm>...</s_nm>...sequences.
Configure Training
YAML config
YAML Config
You can save the YAML below as gemma4_31b_cord_v2.yaml to train on the CORD-v2 dataset.
Why gemma4_prefix_collate_fn?
Gemma 4 31B instruction-tuned models always emit a thinking-channel prefix
(<|channel>thought\n<channel|>) before the actual response. When this prefix
is absent from training sequences, the model predicts <|channel> but the label
says answer text, inflating initial loss to ~9. The gemma4_prefix_collate_fn
injects this prefix (masked as -100 in labels so the model is not penalized for it)
and brings initial loss down to ~3.
Launch Fine-Tuning
What to Watch
- Loss drops rapidly from ~0.73 to ~0.04 in the first 50 steps, then stabilizes around 0.005
- Validation loss reaches ~0.018 by step 199 (best checkpoint)
- Training takes ~15 min on 8x H100 (300 steps, 800 training samples)
Training Log
Checkpoints Saved
Tip:
LOWEST_VALsymlink points to the checkpoint with the best validation loss. Use this for inference evaluation.
Evaluate the Fine-Tuned Model
Export and Load a Consolidated Checkpoint with HF AutoModelForMultimodalLM
Because the config uses save_consolidated: final, the final checkpoint includes
model/consolidated/. Earlier checkpoints store sharded safetensors plus a
generated helper; run the helper for an earlier checkpoint you want to evaluate:
The helper writes an HF-compatible model/consolidated/ directory. Use HF’s
AutoModelForMultimodalLM for inference (generation), and load the processor
from the base model path.
Fine-Tuned Output (Test Sample 1 — Perfect NED=0.0)
Parse the Structured Output
You can convert the token sequence back to a structured dict:
Example parsed output (test sample 4):