Gemma 4 31B | NVIDIA NeMo AutoModel

A step-by-step guide for fine-tuning Gemma 4 31B to extract structured receipt data from scanned images using NeMo AutoModel.

What is Gemma 4 31B?

Gemma 4 31B is a dense vision-language model with a 60-layer transformer decoder, SigLIP vision encoder, and support for multimodal inputs (images, audio, text).

Key architectural details:

Mixed attention: sliding window (512 tokens) + full attention (every 6th layer)
32 attention heads, 16 KV heads (GQA)
Hidden dim 5376, vocab size 262,144
bfloat16, final logit softcapping at 30.0
Thinking-channel support (<|channel>thought\n<channel|> prefix)

The Task

We fine-tune Gemma 4 31B on the CORD-v2 (Consolidated Receipt Dataset) to extract structured fields from scanned receipts:

Field	Example
`menu`	Item names, quantities, unit prices, sub-totals
`sub_total`	Subtotal details (subtotal price, discount, tax, etc.)
`total`	Total price, cash price, change price, etc.
`void_menu`	Voided items (if any)

The base model produces free-form descriptions. After fine-tuning, it outputs structured XML-like token sequences matching the receipt fields.

Guide Overview

Step	Description
Step 0	Environment setup
Step 1	Explore the CORD-v2 dataset
Step 2	Evaluate the base model (before fine-tuning)
Step 3	Training configuration
Step 4	Launch fine-tuning
Step 5	Evaluate the fine-tuned model
Step 6	Compare results

Hardware Requirements

8x A100 80 GB (or 8x H100) GPUs required for 31B with FSDP2 + activation checkpointing
Estimated training time: ~45 min on 8x H100 (800 training samples, 500 steps)

Step 0 — Environment Setup

This guide runs inside the NeMo AutoModel Docker container:

$ docker run -it --rm --gpus all --ipc=host --network host \
>     -v $(pwd):/workspace \
>     nvcr.io/nvidia/nemo-automodel:26.06.00
$ 
$ # Inside the container:
$ huggingface-cli login          # needed for gated model access
$ cd /opt/Automodel

Note: Gemma 4 requires a transformers version that includes the model implementation. Please make sure proper transformers is installed.

Explore the CORD-v2 Dataset

CORD-v2 is a Consolidated Receipt Dataset for Post-OCR Parsing containing scanned receipts with structured ground-truth JSON labels.

1 import json
2 from datasets import load_dataset
3 
4 dataset = load_dataset("naver-clova-ix/cord-v2")
5 
6 print(f"Train      : {len(dataset['train'])} samples")
7 print(f"Validation : {len(dataset['validation'])} samples")
8 print(f"Test       : {len(dataset['test'])} samples")
9 
10 # Inspect a sample
11 ex = dataset["train"][0]
12 gt = json.loads(ex["ground_truth"])["gt_parse"]
13 print(f"\nGround-truth keys: {list(gt.keys())}")
14 
15 for key in gt:
16     if isinstance(gt[key], list):
17         print(f"\n  {key} ({len(gt[key])} items):")
18         for item in gt[key][:2]:
19             print(f"    {item}")
20     else:
21         print(f"\n  {key}: {gt[key]}")

Expected output:

Train      : 800 samples
Validation : 100 samples
Test       : 100 samples
Ground-truth keys: ['menu', 'sub_total', 'total', 'void_menu']
  menu (7 items):
    {'nm': 'ABRA KADABRA FLAME GRILLED', 'num': '1', 'unitprice': '39,000', 'cnt': '1', 'price': '39,000'}
    {'nm': 'Lemon Tea', 'num': '1', 'unitprice': '7,000', 'cnt': '1', 'price': '7,000'}
  sub_total: {'subtotal_price': '87,000', 'discount_price': '0', 'tax_price': '7,909'}
  total: {'total_price': '87,000', 'cashprice': '100,000', 'changeprice': '13,000'}
  void_menu: []

Target Format: JSON-to-Token Conversion

NeMo AutoModel converts structured JSON into an XML-like token sequence using the json2token() function. This is the format the model is trained to produce:

1 from nemo_automodel.components.datasets.vlm.utils import json2token
2 
3 token_seq = json2token(gt, sort_json_key=True)
4 print(f"Token sequence (first 300 chars):\n  {token_seq[:300]}...")
5 print(f"\nTotal length: {len(token_seq)} chars")

Expected output:

Token sequence (first 300 chars):
  <s_menu><s_cnt>1</s_cnt><s_nm>ABRA KADABRA FLAME GRILLED</s_nm><s_num>1</s_num>
  <s_price>39,000</s_price><s_unitprice>39,000</s_unitprice><sep/><s_cnt>1</s_cnt>
  <s_nm>Lemon Tea</s_nm><s_num>1</s_num><s_price>7,000</s_price><s_unitprice>7,000
  </s_unitprice><sep/>...
Total length: 827 chars

Evaluate the Base Model (Before Fine-Tuning)

Load the pretrained Gemma 4 31B model and run it on receipt images. The base model will produce free-form descriptions instead of structured token sequences.

1 import os
2 import json
3 import torch
4 from transformers import AutoProcessor
5 from nemo_automodel import NeMoAutoModelForImageTextToText
6 from nemo_automodel.components.datasets.vlm.utils import json2token
7 from datasets import load_dataset
8 
9 # --- Helpers ---
10 
11 def compute_ned(pred: str, target: str) -> float:
12     """Normalized Edit Distance (0 = perfect match, 1 = completely different)."""
13     m, n = len(pred), len(target)
14     if max(m, n) == 0:
15         return 0.0
16     dp = list(range(n + 1))
17     for i in range(1, m + 1):
18         prev, dp[0] = dp[0], i
19         for j in range(1, n + 1):
20             tmp = dp[j]
21             dp[j] = prev if pred[i - 1] == target[j - 1] else 1 + min(dp[j], dp[j - 1], prev)
22             prev = tmp
23     return dp[n] / max(m, n)
24 
25 
26 def run_gemma4_inference(model, processor, pil_image, prompt="Describe this image.",
27                          max_new_tokens=1024):
28     """Run Gemma 4 inference on a single image."""
29     messages = [
30         {
31             "role": "user",
32             "content": [
33                 {"type": "image", "image": pil_image},
34                 {"type": "text", "text": prompt},
35             ],
36         },
37     ]
38     inputs = processor.apply_chat_template(
39         messages,
40         tokenize=True,
41         add_generation_prompt=True,
42         return_tensors="pt",
43         return_dict=True,
44     ).to(model.device)
45 
46     with torch.inference_mode():
47         outputs = model.generate(**inputs, max_new_tokens=max_new_tokens, do_sample=False)
48 
49     generated_text = processor.decode(outputs[0], skip_special_tokens=True)
50     prompt_length = len(processor.decode(inputs["input_ids"][0], skip_special_tokens=True))
51     return generated_text[prompt_length:].strip()
52 
53 
54 def evaluate_receipts(model, processor, test_dataset, n_samples=20):
55     """Evaluate model on receipt test set, return avg NED and per-sample results."""
56     model.eval()
57     results = []
58     n = min(n_samples, len(test_dataset))
59     for i in range(n):
60         ex = test_dataset[i]
61         gt = json.loads(ex["ground_truth"])["gt_parse"]
62         target = json2token(gt, sort_json_key=True)
63         pred = run_gemma4_inference(model, processor, ex["image"])
64         ned = compute_ned(pred, target)
65         results.append({"idx": i, "ned": ned, "pred": pred, "target": target, "gt": gt})
66         print(f"  Sample {i:2d}: NED = {ned:.4f}")
67     avg_ned = sum(r["ned"] for r in results) / len(results)
68     print(f"\n  Average NED: {avg_ned:.4f}")
69     return avg_ned, results
70 
71 # --- Load base model ---
72 
73 MODEL_PATH = "google/gemma-4-31B-it"
74 
75 processor = AutoProcessor.from_pretrained(MODEL_PATH)
76 base_model = NeMoAutoModelForImageTextToText.from_pretrained(
77     MODEL_PATH,
78     torch_dtype=torch.bfloat16,
79     use_liger_kernel=True,
80     attn_implementation="flash_attention_2",
81     text_config={"use_cache": False},
82 ).eval().to("cuda")
83 
84 print(f"Parameters: {sum(p.numel() for p in base_model.parameters()):,}")
85 
86 # --- Evaluate ---
87 
88 dataset = load_dataset("naver-clova-ix/cord-v2")
89 print("\nEvaluating base model on receipt test set:")
90 base_avg_ned, base_results = evaluate_receipts(base_model, processor, dataset["test"])

Expected base model output (receipt image):

  Sample  0: NED = 0.8734
  Sample  1: NED = 0.9012
  ...
  Average NED: 0.8850

Example base model prediction (free-form, not structured):

The image shows a receipt from a restaurant. The total amount is 87,000 with items
including ABRA KADABRA FLAME GRILLED for 39,000 and Lemon Tea for 7,000...

The base model produces readable descriptions but not the structured token format we need. Fine-tuning teaches it to output <s_menu><s_nm>...</s_nm>... sequences.

Configure Training

YAML config

YAML Config

You can save the YAML below as gemma4_31b_cord_v2.yaml to train on the CORD-v2 dataset.

1 step_scheduler:
2   global_batch_size: 8
3   local_batch_size: 1
4   ckpt_every_steps: 100
5   val_every_steps: 100
6   max_steps: 500
7 
8 dist_env:
9   backend: nccl
10   timeout_minutes: 60
11 
12 model:
13   _target_: nemo_automodel.NeMoAutoModelForImageTextToText.from_pretrained
14   pretrained_model_name_or_path: google/gemma-4-31B-it
15   torch_dtype: torch.bfloat16
16   use_liger_kernel: true
17   use_sdpa_patching: false
18   attn_implementation: flash_attention_2
19   text_config:
20     use_cache: false
21 
22 checkpoint:
23   enabled: true
24   checkpoint_dir: vlm_checkpoints/gemma4_31b_cord_v2/
25   model_save_format: safetensors
26   save_consolidated: final # Recommended: export consolidated HF weights only for the final checkpoint.
27 
28 distributed:
29   strategy: fsdp2
30   activation_checkpointing: true
31 
32 loss_fn:
33   _target_: nemo_automodel.components.loss.masked_ce.MaskedCrossEntropy
34 
35 dataset:
36   _target_: nemo_automodel.components.datasets.vlm.datasets.make_cord_v2_dataset
37   path_or_dataset: naver-clova-ix/cord-v2
38   split: train
39 
40 dataloader:
41   collate_fn:
42     _target_: nemo_automodel.components.datasets.vlm.collate_fns.gemma4_prefix_collate_fn
43 
44 validation_dataset:
45   _target_: nemo_automodel.components.datasets.vlm.datasets.make_cord_v2_dataset
46   path_or_dataset: naver-clova-ix/cord-v2
47   split: validation
48 
49 optimizer:
50   _target_: torch.optim.AdamW
51   lr: 1e-5
52   weight_decay: 0.01
53   betas: [0.9, 0.95]
54 
55 lr_scheduler:
56   lr_decay_style: cosine
57 
58 freeze_config:
59   freeze_embeddings: true
60   freeze_vision_tower: true
61   freeze_audio_tower: true
62   freeze_language_model: false

Why `gemma4_prefix_collate_fn`?

Gemma 4 31B instruction-tuned models always emit a thinking-channel prefix (<|channel>thought\n<channel|>) before the actual response. When this prefix is absent from training sequences, the model predicts <|channel> but the label says answer text, inflating initial loss to ~9. The gemma4_prefix_collate_fn injects this prefix (masked as -100 in labels so the model is not penalized for it) and brings initial loss down to ~3.

Launch Fine-Tuning

$ torchrun --nproc-per-node=8 \
>     examples/vlm_finetune/finetune.py \
>     -c gemma4_31b_cord_v2.yaml \
>     2>&1 | tee logs/train_gemma4_31b_cord_v2.log

What to Watch

Loss drops rapidly from ~0.73 to ~0.04 in the first 50 steps, then stabilizes around 0.005
Validation loss reaches ~0.018 by step 199 (best checkpoint)
Training takes ~15 min on 8x H100 (300 steps, 800 training samples)

Training Log

step    0 | loss 0.7350 | grad_norm  35.65 | lr 1.18e-06 | mem 60.90 GiB | tps/gpu  45
step   10 | loss 0.5489 | grad_norm  26.19 | lr 2.98e-06 | mem 40.36 GiB | tps/gpu 425
step   20 | loss 0.1455 | grad_norm  10.53 | lr 4.78e-06 | mem 40.42 GiB | tps/gpu 438
step   50 | loss 0.0406 | grad_norm  27.16 | lr 1.00e-05 | mem 40.34 GiB | tps/gpu 377
step  100 | loss 0.0148 | grad_norm   7.02 | lr 9.70e-06 | mem 40.36 GiB | tps/gpu 449
step  200 | loss 0.0065 | grad_norm   2.28 | lr 7.52e-06 | mem 40.44 GiB | tps/gpu 441
step  300 | loss 0.0041 | grad_norm   2.10 | lr 3.16e-06 | mem 40.53 GiB | tps/gpu 448
Validation:
  step  99 | val_loss 0.0225
  step 199 | val_loss 0.0183  <-- LOWEST_VAL (best checkpoint)
  step 299 | val_loss 0.0192

Checkpoints Saved

vlm_checkpoints/gemma4_31b_cord_v2/
  epoch_0_step_99/
  epoch_0_step_199/
  epoch_0_step_299/
    model/
      consolidated/          <-- HF-compatible checkpoint for inference
        config.json
        model.safetensors.index.json
        model-00001-of-00013.safetensors
        ...
    optim/
    rng/
    dataloader/
  LATEST -> epoch_0_step_299
  LOWEST_VAL -> epoch_0_step_199
  training.jsonl             <-- per-step metrics
  validation.jsonl           <-- per-validation metrics

Tip: LOWEST_VAL symlink points to the checkpoint with the best validation loss. Use this for inference evaluation.

Evaluate the Fine-Tuned Model

Export and Load a Consolidated Checkpoint with HF AutoModelForMultimodalLM

Because the config uses save_consolidated: final, the final checkpoint includes model/consolidated/. Earlier checkpoints store sharded safetensors plus a generated helper; run the helper for an earlier checkpoint you want to evaluate:

$ bash vlm_checkpoints/gemma4_31b_cord_v2/<checkpoint>/model/consolidate.sh

The helper writes an HF-compatible model/consolidated/ directory. Use HF’s AutoModelForMultimodalLM for inference (generation), and load the processor from the base model path.

1 import json
2 import os
3 import torch
4 from datasets import load_dataset
5 from transformers import AutoProcessor, AutoModelForMultimodalLM
6 from nemo_automodel.components.datasets.vlm.utils import json2token
7 
8 # Paths
9 BASE_MODEL = "google/gemma-4-31B-it"
10 CKPT_DIR = "vlm_checkpoints/gemma4_31b_cord_v2"
11 best_ckpt = os.path.realpath(os.path.join(CKPT_DIR, "LOWEST_VAL"))
12 consolidated = os.path.join(best_ckpt, "model", "consolidated")
13 
14 # Load processor from base model, model from fine-tuned checkpoint
15 processor = AutoProcessor.from_pretrained(BASE_MODEL)
16 model = AutoModelForMultimodalLM.from_pretrained(
17     consolidated,
18     dtype=torch.bfloat16,
19     device_map="auto",
20 ).eval()
21 
22 # Evaluate on test set
23 dataset = load_dataset("naver-clova-ix/cord-v2")
24 print("Evaluating fine-tuned model:")
25 ft_avg_ned, ft_results = evaluate_receipts(model, processor, dataset["test"])

Fine-Tuned Output (Test Sample 1 — Perfect NED=0.0)

<s_total><s_total_price>91000</s_total_price><s_cashprice>91000</s_cashprice>
</s_total><s_menu><s_price>17500</s_price><s_nm>J.STB PROMO</s_nm><sep/>
<s_price>46000</s_price><s_nm>Y.B.BAT</s_nm><sep/><s_price>27500</s_price>
<s_nm>Y.BASO PROM</s_nm></s_menu>

Parse the Structured Output

You can convert the token sequence back to a structured dict:

1 import re
2 
3 def token2json(token_seq):
4     """Convert a token sequence back to a JSON-like dict."""
5     result = {}
6     pattern = r"<s_(\w+)>(.*?)</s_\1>"
7     matches = re.findall(pattern, token_seq, re.DOTALL)
8     for key, value in matches:
9         if "<sep/>" in value:
10             items = value.split("<sep/>")
11             result[key] = [token2json(item) if "<s_" in item else item for item in items]
12         elif "<s_" in value:
13             result[key] = token2json(value)
14         else:
15             result[key] = value
16     return result
17 
18 parsed = token2json(prediction)
19 print(json.dumps(parsed, indent=2))

Example parsed output (test sample 4):

1 {
2   "total": {"total_price": "174,600", "changeprice": "25,400", "cashprice": "200,000"},
3   "sub_total": {"subtotal_price": "194,000", "discount_price": "19,400"},
4   "menu": [
5     {"price": "82,000", "nm": "ICE BLACKCOFFE"},
6     {"price": "44,000", "nm": "C.Capuccino (L)"},
7     {"price": "30,000", "nm": "C.WHITE COFFE"},
8     {"price": "38,000", "nm": "C.Capuccino (L)"}
9   ]
10 }

Compare Results

Metrics (20 Test Samples)

Metric	Fine-Tuned (epoch_1_step_199)
Average NED	0.0601
Field-Level Accuracy	92.6%
Perfect matches (NED=0.0)	10/20 (50%)
Near-perfect (NED<0.05)	14/20 (70%)

Field-Level Extraction Accuracy (Actual)

Field                 Correct / Total  Accuracy
--------------------------------------------------
total_price                18 /    19     94.7%
subtotal_price             13 /    14     92.9%
tax_price                   7 /     8     87.5%
cashprice                  13 /    15     86.7%
changeprice                12 /    12    100.0%
OVERALL                    63 /    68     92.6%