Fine-Tuning Gemma 4 31B on CORD-v2 Receipts — End-to-End Guide#
A step-by-step guide for fine-tuning Gemma 4 31B to extract structured receipt data from scanned images using NeMo Automodel.
What is Gemma 4 31B?#
Gemma 4 31B is a dense vision-language model with a 60-layer transformer decoder, SigLIP vision encoder, and support for multimodal inputs (images, audio, text).
Key architectural details:
Mixed attention: sliding window (512 tokens) + full attention (every 6th layer)
32 attention heads, 16 KV heads (GQA)
Hidden dim 5376, vocab size 262,144
bfloat16, final logit softcapping at 30.0
Thinking-channel support (
<|channel>thought\n<channel|>prefix)
The Task#
We fine-tune Gemma 4 31B on the CORD-v2 (Consolidated Receipt Dataset) to extract structured fields from scanned receipts:
Field |
Example |
|---|---|
|
Item names, quantities, unit prices, sub-totals |
|
Subtotal details (subtotal price, discount, tax, etc.) |
|
Total price, cash price, change price, etc. |
|
Voided items (if any) |
The base model produces free-form descriptions. After fine-tuning, it outputs structured XML-like token sequences matching the receipt fields.
Guide Overview#
Step |
Description |
|---|---|
Step 0 |
Environment setup |
Step 1 |
Explore the CORD-v2 dataset |
Step 2 |
Evaluate the base model (before fine-tuning) |
Step 3 |
Training configuration |
Step 4 |
Launch fine-tuning |
Step 5 |
Evaluate the fine-tuned model |
Step 6 |
Compare results |
Hardware Requirements#
8x A100 80 GB (or 8x H100) GPUs required for 31B with FSDP2 + activation checkpointing
Estimated training time: ~45 min on 8x H100 (800 training samples, 500 steps)
Step 0 — Environment Setup#
This guide runs inside the NeMo Automodel Docker container:
docker run -it --rm --gpus all --ipc=host --network host \
-v $(pwd):/workspace \
nvcr.io/nvidia/nemo-automodel:26.02
# Inside the container:
huggingface-cli login # needed for gated model access
cd /opt/Automodel
Note: Gemma 4 requires the transformers version that include the model implementation. Please make sure proper transformers is installed.
Step 1 — Explore the CORD-v2 Dataset#
CORD-v2 is a Consolidated Receipt Dataset for Post-OCR Parsing containing scanned receipts with structured ground-truth JSON labels.
import json
from datasets import load_dataset
dataset = load_dataset("naver-clova-ix/cord-v2")
print(f"Train : {len(dataset['train'])} samples")
print(f"Validation : {len(dataset['validation'])} samples")
print(f"Test : {len(dataset['test'])} samples")
# Inspect a sample
ex = dataset["train"][0]
gt = json.loads(ex["ground_truth"])["gt_parse"]
print(f"\nGround-truth keys: {list(gt.keys())}")
for key in gt:
if isinstance(gt[key], list):
print(f"\n {key} ({len(gt[key])} items):")
for item in gt[key][:2]:
print(f" {item}")
else:
print(f"\n {key}: {gt[key]}")
Expected output:
Train : 800 samples
Validation : 100 samples
Test : 100 samples
Ground-truth keys: ['menu', 'sub_total', 'total', 'void_menu']
menu (7 items):
{'nm': 'ABRA KADABRA FLAME GRILLED', 'num': '1', 'unitprice': '39,000', 'cnt': '1', 'price': '39,000'}
{'nm': 'Lemon Tea', 'num': '1', 'unitprice': '7,000', 'cnt': '1', 'price': '7,000'}
sub_total: {'subtotal_price': '87,000', 'discount_price': '0', 'tax_price': '7,909'}
total: {'total_price': '87,000', 'cashprice': '100,000', 'changeprice': '13,000'}
void_menu: []
Target format: JSON-to-token conversion#
NeMo Automodel converts structured JSON into an XML-like token sequence using
the json2token() function. This is the format the model is trained to produce:
from nemo_automodel.components.datasets.vlm.utils import json2token
token_seq = json2token(gt, sort_json_key=True)
print(f"Token sequence (first 300 chars):\n {token_seq[:300]}...")
print(f"\nTotal length: {len(token_seq)} chars")
Expected output:
Token sequence (first 300 chars):
<s_menu><s_cnt>1</s_cnt><s_nm>ABRA KADABRA FLAME GRILLED</s_nm><s_num>1</s_num>
<s_price>39,000</s_price><s_unitprice>39,000</s_unitprice><sep/><s_cnt>1</s_cnt>
<s_nm>Lemon Tea</s_nm><s_num>1</s_num><s_price>7,000</s_price><s_unitprice>7,000
</s_unitprice><sep/>...
Total length: 827 chars
Step 2 — Evaluate the Base Model (Before Fine-Tuning)#
Load the pretrained Gemma 4 31B model and run it on receipt images. The base model will produce free-form descriptions instead of structured token sequences.
import os
import json
import torch
from transformers import AutoProcessor
from nemo_automodel import NeMoAutoModelForImageTextToText
from nemo_automodel.components.datasets.vlm.utils import json2token
from datasets import load_dataset
# --- Helpers ---
def compute_ned(pred: str, target: str) -> float:
"""Normalized Edit Distance (0 = perfect match, 1 = completely different)."""
m, n = len(pred), len(target)
if max(m, n) == 0:
return 0.0
dp = list(range(n + 1))
for i in range(1, m + 1):
prev, dp[0] = dp[0], i
for j in range(1, n + 1):
tmp = dp[j]
dp[j] = prev if pred[i - 1] == target[j - 1] else 1 + min(dp[j], dp[j - 1], prev)
prev = tmp
return dp[n] / max(m, n)
def run_gemma4_inference(model, processor, pil_image, prompt="Describe this image.",
max_new_tokens=1024):
"""Run Gemma 4 inference on a single image."""
messages = [
{
"role": "user",
"content": [
{"type": "image", "image": pil_image},
{"type": "text", "text": prompt},
],
},
]
inputs = processor.apply_chat_template(
messages,
tokenize=True,
add_generation_prompt=True,
return_tensors="pt",
return_dict=True,
).to(model.device)
with torch.inference_mode():
outputs = model.generate(**inputs, max_new_tokens=max_new_tokens, do_sample=False)
generated_text = processor.decode(outputs[0], skip_special_tokens=True)
prompt_length = len(processor.decode(inputs["input_ids"][0], skip_special_tokens=True))
return generated_text[prompt_length:].strip()
def evaluate_receipts(model, processor, test_dataset, n_samples=20):
"""Evaluate model on receipt test set, return avg NED and per-sample results."""
model.eval()
results = []
n = min(n_samples, len(test_dataset))
for i in range(n):
ex = test_dataset[i]
gt = json.loads(ex["ground_truth"])["gt_parse"]
target = json2token(gt, sort_json_key=True)
pred = run_gemma4_inference(model, processor, ex["image"])
ned = compute_ned(pred, target)
results.append({"idx": i, "ned": ned, "pred": pred, "target": target, "gt": gt})
print(f" Sample {i:2d}: NED = {ned:.4f}")
avg_ned = sum(r["ned"] for r in results) / len(results)
print(f"\n Average NED: {avg_ned:.4f}")
return avg_ned, results
# --- Load base model ---
MODEL_PATH = "google/gemma-4-31B-it"
processor = AutoProcessor.from_pretrained(MODEL_PATH)
base_model = NeMoAutoModelForImageTextToText.from_pretrained(
MODEL_PATH,
torch_dtype=torch.bfloat16,
use_liger_kernel=True,
attn_implementation="flash_attention_2",
text_config={"use_cache": False},
).eval().to("cuda")
print(f"Parameters: {sum(p.numel() for p in base_model.parameters()):,}")
# --- Evaluate ---
dataset = load_dataset("naver-clova-ix/cord-v2")
print("\nEvaluating base model on receipt test set:")
base_avg_ned, base_results = evaluate_receipts(base_model, processor, dataset["test"])
Expected base model output (receipt image):
Sample 0: NED = 0.8734
Sample 1: NED = 0.9012
...
Average NED: 0.8850
Example base model prediction (free-form, not structured):
The image shows a receipt from a restaurant. The total amount is 87,000 with items
including ABRA KADABRA FLAME GRILLED for 39,000 and Lemon Tea for 7,000...
The base model produces readable descriptions but not the structured token format we need. Fine-tuning teaches it to output
<s_menu><s_nm>...</s_nm>...sequences.
Step 3 — Training Configuration#
YAML config#
You can save the yaml below as gemma4_31b_cord_v2.yaml for training cord_v2 dataset.
step_scheduler:
global_batch_size: 8
local_batch_size: 1
ckpt_every_steps: 100
val_every_steps: 100
max_steps: 500
dist_env:
backend: nccl
timeout_minutes: 60
model:
_target_: nemo_automodel.NeMoAutoModelForImageTextToText.from_pretrained
pretrained_model_name_or_path: google/gemma-4-31B-it
torch_dtype: torch.bfloat16
use_liger_kernel: true
use_sdpa_patching: false
attn_implementation: flash_attention_2
text_config:
use_cache: false
checkpoint:
enabled: true
checkpoint_dir: vlm_checkpoints/gemma4_31b_cord_v2/
model_save_format: safetensors
save_consolidated: true
distributed:
strategy: fsdp2
activation_checkpointing: true
loss_fn:
_target_: nemo_automodel.components.loss.masked_ce.MaskedCrossEntropy
dataset:
_target_: nemo_automodel.components.datasets.vlm.datasets.make_cord_v2_dataset
path_or_dataset: naver-clova-ix/cord-v2
split: train
dataloader:
collate_fn:
_target_: nemo_automodel.components.datasets.vlm.collate_fns.gemma4_prefix_collate_fn
validation_dataset:
_target_: nemo_automodel.components.datasets.vlm.datasets.make_cord_v2_dataset
path_or_dataset: naver-clova-ix/cord-v2
split: validation
optimizer:
_target_: torch.optim.AdamW
lr: 1e-5
weight_decay: 0.01
betas: [0.9, 0.95]
lr_scheduler:
lr_decay_style: cosine
freeze_config:
freeze_embeddings: true
freeze_vision_tower: true
freeze_audio_tower: true
freeze_language_model: false
Why gemma4_prefix_collate_fn?#
Gemma 4 31B instruction-tuned models always emit a thinking-channel prefix
(<|channel>thought\n<channel|>) before the actual response. When this prefix
is absent from training sequences, the model predicts <|channel> but the label
says answer text, inflating initial loss to ~9. The gemma4_prefix_collate_fn
injects this prefix (masked as -100 in labels so the model is not penalized for it)
and brings initial loss down to ~3.
Step 4 — Launch Fine-Tuning#
torchrun --nproc-per-node=8 \
examples/vlm_finetune/finetune.py \
-c gemma4_31b_cord_v2.yaml \
2>&1 | tee logs/train_gemma4_31b_cord_v2.log
What to watch#
Loss drops rapidly from ~0.73 to ~0.04 in the first 50 steps, then stabilizes around 0.005
Validation loss reaches ~0.018 by step 199 (best checkpoint)
Training takes ~15 min on 8x H100 (300 steps, 800 training samples)
Training log#
step 0 | loss 0.7350 | grad_norm 35.65 | lr 1.18e-06 | mem 60.90 GiB | tps/gpu 45
step 10 | loss 0.5489 | grad_norm 26.19 | lr 2.98e-06 | mem 40.36 GiB | tps/gpu 425
step 20 | loss 0.1455 | grad_norm 10.53 | lr 4.78e-06 | mem 40.42 GiB | tps/gpu 438
step 50 | loss 0.0406 | grad_norm 27.16 | lr 1.00e-05 | mem 40.34 GiB | tps/gpu 377
step 100 | loss 0.0148 | grad_norm 7.02 | lr 9.70e-06 | mem 40.36 GiB | tps/gpu 449
step 200 | loss 0.0065 | grad_norm 2.28 | lr 7.52e-06 | mem 40.44 GiB | tps/gpu 441
step 300 | loss 0.0041 | grad_norm 2.10 | lr 3.16e-06 | mem 40.53 GiB | tps/gpu 448
Validation:
step 99 | val_loss 0.0225
step 199 | val_loss 0.0183 <-- LOWEST_VAL (best checkpoint)
step 299 | val_loss 0.0192
Checkpoints saved#
vlm_checkpoints/gemma4_31b_cord_v2/
epoch_0_step_99/
epoch_0_step_199/
epoch_0_step_299/
model/
consolidated/ <-- HF-compatible checkpoint for inference
config.json
model.safetensors.index.json
model-00001-of-00013.safetensors
...
optim/
rng/
dataloader/
LATEST -> epoch_0_step_299
LOWEST_VAL -> epoch_0_step_199
training.jsonl <-- per-step metrics
validation.jsonl <-- per-validation metrics
Tip:
LOWEST_VALsymlink points to the checkpoint with the best validation loss. Use this for inference evaluation.
Step 5 — Evaluate the Fine-Tuned Model#
Load consolidated checkpoint with HF AutoModelForMultimodalLM#
Because we set save_consolidated: true in the config, each checkpoint contains
an HF-compatible model/consolidated/ directory. Use HF’s AutoModelForMultimodalLM
for inference (generation), and load the processor from the base model path.
import json
import os
import torch
from datasets import load_dataset
from transformers import AutoProcessor, AutoModelForMultimodalLM
from nemo_automodel.components.datasets.vlm.utils import json2token
# Paths
BASE_MODEL = "google/gemma-4-31B-it"
CKPT_DIR = "vlm_checkpoints/gemma4_31b_cord_v2"
best_ckpt = os.path.realpath(os.path.join(CKPT_DIR, "LOWEST_VAL"))
consolidated = os.path.join(best_ckpt, "model", "consolidated")
# Load processor from base model, model from fine-tuned checkpoint
processor = AutoProcessor.from_pretrained(BASE_MODEL)
model = AutoModelForMultimodalLM.from_pretrained(
consolidated,
dtype=torch.bfloat16,
device_map="auto",
).eval()
# Evaluate on test set
dataset = load_dataset("naver-clova-ix/cord-v2")
print("Evaluating fine-tuned model:")
ft_avg_ned, ft_results = evaluate_receipts(model, processor, dataset["test"])
Fine-tuned output (test sample 1 – perfect NED=0.0)#
<s_total><s_total_price>91000</s_total_price><s_cashprice>91000</s_cashprice>
</s_total><s_menu><s_price>17500</s_price><s_nm>J.STB PROMO</s_nm><sep/>
<s_price>46000</s_price><s_nm>Y.B.BAT</s_nm><sep/><s_price>27500</s_price>
<s_nm>Y.BASO PROM</s_nm></s_menu>
Parsing the structured output#
You can convert the token sequence back to a structured dict:
import re
def token2json(token_seq):
"""Convert a token sequence back to a JSON-like dict."""
result = {}
pattern = r"<s_(\w+)>(.*?)</s_\1>"
matches = re.findall(pattern, token_seq, re.DOTALL)
for key, value in matches:
if "<sep/>" in value:
items = value.split("<sep/>")
result[key] = [token2json(item) if "<s_" in item else item for item in items]
elif "<s_" in value:
result[key] = token2json(value)
else:
result[key] = value
return result
parsed = token2json(prediction)
print(json.dumps(parsed, indent=2))
Example parsed output (test sample 4):
{
"total": {"total_price": "174,600", "changeprice": "25,400", "cashprice": "200,000"},
"sub_total": {"subtotal_price": "194,000", "discount_price": "19,400"},
"menu": [
{"price": "82,000", "nm": "ICE BLACKCOFFE"},
{"price": "44,000", "nm": "C.Capuccino (L)"},
{"price": "30,000", "nm": "C.WHITE COFFE"},
{"price": "38,000", "nm": "C.Capuccino (L)"}
]
}
Step 6 — Results Comparison#
Metrics (20 test samples)#
Metric |
Fine-Tuned (epoch_1_step_199) |
|---|---|
Average NED |
0.0601 |
Field-Level Accuracy |
92.6% |
Perfect matches (NED=0.0) |
10/20 (50%) |
Near-perfect (NED<0.05) |
14/20 (70%) |
Field-level extraction accuracy (actual)#
Field Correct / Total Accuracy
--------------------------------------------------
total_price 18 / 19 94.7%
subtotal_price 13 / 14 92.9%
tax_price 7 / 8 87.5%
cashprice 13 / 15 86.7%
changeprice 12 / 12 100.0%
OVERALL 63 / 68 92.6%