Fine-Tuning NemotronOmni on CORD-v2 Receipts — End-to-End Guide

View as Markdown

A step-by-step guide for fine-tuning NemotronOmni (33B MoE) to extract structured receipt data from scanned images using NeMo Automodel. Covers both full SFT and LoRA PEFT.


What is NemotronOmni?

NemotronOmni (NemotronH_Nano_Omni_Reasoning_V3) is a ~33B multimodal MoE model supporting image, video, and audio inputs.

Key architectural details:

  • LLM backbone: NemotronV3 hybrid Mamba2 + Attention + MoE, 52 layers, hidden dim 2688
  • Vision encoder: RADIO v2.5-H (ViT-Huge), 256 vision tokens per tile
  • Audio encoder: Parakeet FastConformer (1024-dim)
  • MoE: 128 experts per MoE layer, top-6 routing with sigmoid gating
  • Total parameters: 33B (31.5B trainable with frozen vision/audio towers)

Fine-Tune for Receipt Field Extraction

We fine-tune NemotronOmni on the CORD-v2 (Consolidated Receipt Dataset) to extract structured fields from scanned receipts:

FieldExample
menuItem names, quantities, prices
sub_totalSubtotal, tax, discount
totalTotal price, cash paid, change

The base model produces free-form descriptions. After fine-tuning, it outputs structured XML-like token sequences matching the receipt fields.

Guide Overview

StepDescription
Step 0Environment setup
Step 1Explore the CORD-v2 dataset
Step 2Training configuration (SFT and LoRA)
Step 3Launch fine-tuning
Step 4Run inference on the fine-tuned model
Step 5Compare SFT vs LoRA results

Hardware Requirements

  • 8x H100 80 GB GPUs required (MoE with EP=8)
  • SFT memory: ~49 GiB per GPU
  • LoRA memory: ~30 GiB per GPU
  • Estimated training time: ~10 min on 8x H100 (400 steps, 800 training samples)

Step 0 — Set Up the Environment

$# Inside the NeMo AutoModel container (26.04+):
$cd /opt/Automodel
$
$# Or from a source checkout:
$git clone -b nemotron-omni ssh://git@gitlab-master.nvidia.com:12051/huiyingl/automodel-omni.git
$cd automodel-omni

NemotronOmni requires mamba_ssm, causal_conv1d, and decord packages, which are included in the NeMo AutoModel container.


Step 1 — Explore the CORD-v2 Dataset

CORD-v2 contains scanned receipts with structured ground-truth JSON labels.

1import json
2from datasets import load_dataset
3
4dataset = load_dataset("naver-clova-ix/cord-v2")
5
6print(f"Train : {len(dataset['train'])} samples")
7print(f"Validation : {len(dataset['validation'])} samples")
8print(f"Test : {len(dataset['test'])} samples")
9
10# Inspect a sample
11ex = dataset["train"][0]
12gt = json.loads(ex["ground_truth"])["gt_parse"]
13print(f"\nGround-truth keys: {list(gt.keys())}")

Expected output:

Train : 800 samples
Validation : 100 samples
Test : 100 samples
Ground-truth keys: ['menu', 'sub_total', 'total', 'void_menu']

Target Format: JSON-to-Token Conversion

NeMo Automodel converts structured JSON into an XML-like token sequence using the json2token() function. This is the format the model is trained to produce:

<s_total><s_total_price>45,500</s_total_price><s_changeprice>4,500</s_changeprice>
<s_cashprice>50,000</s_cashprice></s_total><s_menu><s_price>16,500</s_price>
<s_nm>REAL GANACHE</s_nm><s_cnt>1</s_cnt><sep/><s_price>13,000</s_price>
<s_nm>EGG TART</s_nm><s_cnt>1</s_cnt></s_menu>

Step 2 — Training Configuration

Full SFT Config

Config file: examples/vlm_finetune/nemotron_omni/nemotron_omni_cord_v2.yaml

1recipe: FinetuneRecipeForVLM
2
3step_scheduler:
4 global_batch_size: 8
5 local_batch_size: 1
6 ckpt_every_steps: 100
7 val_every_steps: 200
8 max_steps: 400
9
10model:
11 _target_: nemo_automodel.NeMoAutoModelForImageTextToText.from_pretrained
12 pretrained_model_name_or_path: <path_to_nemotron_omni_v2.0>
13 trust_remote_code: true
14 torch_dtype: torch.bfloat16
15 backend:
16 _target_: nemo_automodel.components.models.common.BackendConfig
17 attn: sdpa
18 linear: torch
19 rms_norm: torch_fp32
20 rope_fusion: false
21 experts: gmm
22 dispatcher: deepep
23 fake_balanced_gate: false
24 enable_hf_state_dict_adapter: true
25
26distributed:
27 strategy: fsdp2
28 ep_size: 8 # 128 MoE experts across 8 GPUs
29
30freeze_config:
31 freeze_embeddings: true
32 freeze_vision_tower: true
33 freeze_audio_tower: true
34 freeze_language_model: false
35
36dataset:
37 _target_: nemo_automodel.components.datasets.vlm.datasets.make_cord_v2_dataset
38 path_or_dataset: naver-clova-ix/cord-v2
39 split: train
40
41dataloader:
42 collate_fn:
43 _target_: nemo_automodel.components.datasets.vlm.collate_fns.nemotron_omni_collate_fn
44 max_length: 4096
45
46optimizer:
47 _target_: torch.optim.AdamW
48 lr: 1e-4
49 weight_decay: 0.01
50 betas: [0.9, 0.95]

LoRA PEFT Config

Config file: examples/vlm_finetune/nemotron_omni/nemotron_omni_cord_v2_peft.yaml

Adds a peft: block to apply LoRA to language model linear layers only:

1peft:
2 _target_: nemo_automodel.components._peft.lora.PeftConfig
3 match_all_linear: false
4 exclude_modules:
5 - "*vision_tower*"
6 - "*vision_model*"
7 - "*audio*"
8 - "*sound*"
9 - "*lm_head*"
10 - "*mlp1*"
11 dim: 64
12 alpha: 128
13 use_triton: true
14
15optimizer:
16 _target_: torch.optim.AdamW
17 lr: 1e-3

Collate function

NemotronOmni uses InternVL-style image handling where each <image> token in the input is replaced by 256 vision embeddings during the model’s forward pass. The collate function:

  1. Extracts images from the conversation
  2. Applies the chat template (which adds <think></think> prefix for the assistant turn)
  3. Processes images through the NemotronOmni processor
  4. Builds image_flags tensors and creates training labels

Step 3 — Launch Fine-Tuning

Full SFT

$torchrun --nproc-per-node=8 \
> examples/vlm_finetune/finetune.py \
> -c examples/vlm_finetune/nemotron_omni/nemotron_omni_cord_v2.yaml

LoRA PEFT

$torchrun --nproc-per-node=8 \
> examples/vlm_finetune/finetune.py \
> -c examples/vlm_finetune/nemotron_omni/nemotron_omni_cord_v2_peft.yaml

Training log — Full SFT

Trainable parameters: 31,570,023,872
Trainable parameters percentage: 95.63%
step 0 | loss 0.6866 | grad_norm 7.57 | lr 1.00e-04 | mem 37.29 GiB | tps/gpu 33
step 10 | loss 0.0705 | grad_norm 1.00 | lr 1.00e-04 | mem 48.95 GiB | tps/gpu 2419
step 50 | loss 0.0173 | grad_norm 0.43 | lr 1.00e-04 | mem 48.72 GiB | tps/gpu 2615
step 100 | loss 0.0115 | grad_norm 0.37 | lr 1.00e-04 | mem 48.84 GiB | tps/gpu 2642
step 200 | loss 0.0099 | grad_norm 0.20 | lr 1.00e-04 | mem 48.76 GiB | tps/gpu 2616
step 300 | loss 0.0056 | grad_norm 0.15 | lr 1.00e-04 | mem 48.72 GiB | tps/gpu 2087
step 399 | loss 0.0039 | grad_norm 0.17 | lr 1.00e-04 | mem 48.79 GiB | tps/gpu 2616
Validation:
step 99 | val_loss 0.0363
step 199 | val_loss 0.0342 <-- LOWEST_VAL
step 299 | val_loss 0.0414
step 399 | val_loss 0.0425

Training log — LoRA PEFT

Trainable parameters: 55,422,976
Trainable parameters percentage: 0.17%
step 0 | loss 0.6866 | grad_norm 1.92 | lr 1.00e-03 | mem 30.26 GiB | tps/gpu 34
step 10 | loss 0.0557 | grad_norm 0.30 | lr 1.00e-03 | mem 30.16 GiB | tps/gpu 2455
step 50 | loss 0.0392 | grad_norm 0.32 | lr 1.00e-03 | mem 30.16 GiB | tps/gpu 3352
step 100 | loss 0.0309 | grad_norm 0.27 | lr 1.00e-03 | mem 30.20 GiB | tps/gpu 2456
step 200 | loss 0.0280 | grad_norm 0.23 | lr 1.00e-03 | mem 30.34 GiB | tps/gpu 2477
step 300 | loss 0.0326 | grad_norm 0.31 | lr 1.00e-03 | mem 30.52 GiB | tps/gpu 2737
step 399 | loss 0.0171 | grad_norm 0.24 | lr 1.00e-03 | mem 30.33 GiB | tps/gpu 3258
Validation:
step 99 | val_loss 0.0449 <-- LOWEST_VAL
step 199 | val_loss 0.0524
step 299 | val_loss 0.0482
step 399 | val_loss 0.0566

Checkpoints saved

checkpoint_dir/
epoch_0_step_99/
epoch_1_step_199/
epoch_2_step_299/
epoch_3_step_399/
model/
consolidated/ <-- HF-compatible checkpoint for inference
config.json
model.safetensors.index.json
model-00001-of-00017.safetensors
...
optim/
rng/
dataloader/
LATEST -> epoch_3_step_399
LOWEST_VAL -> epoch_1_step_199
training.jsonl
validation.jsonl

For LoRA, the checkpoint saves adapter weights instead:

model/
adapter_model.safetensors (~27 MB)
adapter_config.json

Tip: LOWEST_VAL symlink points to the checkpoint with the best validation loss.


Step 4 — Run Inference on the Fine-Tuned Model

Full SFT inference

Load the consolidated checkpoint and run inference on a handful of validation samples to spot-check structured output.

1import torch
2import json
3from transformers import AutoModel, AutoProcessor
4from datasets import load_dataset
5from nemo_automodel.components.datasets.vlm.utils import json2token
6
7CKPT = "<checkpoint_dir>/LOWEST_VAL/model/consolidated"
8
9# Load processor
10processor = AutoProcessor.from_pretrained(CKPT, trust_remote_code=True)
11tokenizer = processor.tokenizer
12
13# `device_map` streams weights directly to GPU; skipping the AutoModel.from_config
14# CPU-instantiation step saves ~5 min on the 30B v3 dump.
15model = AutoModel.from_pretrained(
16 CKPT, trust_remote_code=True, torch_dtype=torch.bfloat16,
17 device_map={"": torch.cuda.current_device()},
18)
19
20# Reset RADIO's `summary_idxs` (non-persistent buffer; can be a meta tensor after load)
21if hasattr(model, "vision_model") and hasattr(model.vision_model, "radio_model"):
22 model.vision_model.radio_model.summary_idxs = None
23
24model.eval()
25
26# Load dataset
27dataset = load_dataset("naver-clova-ix/cord-v2")
28
29# v3 processor returns extra placeholder-expansion metadata that is NOT a generate() kwarg.
30PROCESSOR_METADATA_KEYS = ("num_patches", "num_tokens", "imgs_sizes")
31
32# Run inference on the first 5 validation samples
33for i in range(5):
34 sample = dataset["validation"][i]
35 image = sample["image"].convert("RGB")
36 gt = json.loads(sample["ground_truth"])["gt_parse"]
37 gt_text = json2token(gt, sort_json_key=True)
38
39 # Build prompt — enable_thinking=False for structured output
40 messages = [{"role": "user", "content": "<image>\nDescribe this image."}]
41 text = tokenizer.apply_chat_template(
42 messages, tokenize=False,
43 add_generation_prompt=True, enable_thinking=False,
44 )
45 inputs = processor(text=text, images=[image], return_tensors="pt")
46 for k in PROCESSOR_METADATA_KEYS:
47 inputs.pop(k, None)
48 inputs = {k: v.to("cuda") if isinstance(v, torch.Tensor) else v for k, v in inputs.items()}
49
50 with torch.no_grad():
51 output_ids = model.generate(**inputs, max_new_tokens=1024, do_sample=False)
52
53 generated = tokenizer.decode(
54 output_ids[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True,
55 ).strip()
56
57 print(f"\n=== Sample {i} ===")
58 print(f"Ground truth: {gt_text}")
59 print(f"Prediction: {generated}")

LoRA PEFT inference

NeMo Automodel saves LoRA adapters under its internal wrapper FQNs (e.g. language_model.model.layers.X.mixer.in_proj), which differ from the HF base model namespace (language_model.backbone.layers.X.mixer.in_proj). To apply the adapter, merge the delta weights directly into the base model with a small FQN translation:

1import json, re
2import torch
3from pathlib import Path
4from safetensors import safe_open
5from transformers import AutoModel, AutoProcessor
6
7BASE = "<path_to_nemotron_omni_v3>"
8ADAPTER = "<ckpt_dir>/LOWEST_VAL/model"
9
10# Load base directly to GPU. Skip AutoModel.from_config — instantiating a 30B
11# model on CPU just to read the class type adds 5+ minutes.
12processor = AutoProcessor.from_pretrained(BASE, trust_remote_code=True)
13model = AutoModel.from_pretrained(
14 BASE, trust_remote_code=True, dtype=torch.bfloat16,
15 device_map={"": torch.cuda.current_device()},
16)
17if hasattr(model, "vision_model") and hasattr(model.vision_model, "radio_model"):
18 model.vision_model.radio_model.summary_idxs = None
19
20# Wrapper -> HF base FQN translation. vision_projector.* targets are listed in
21# adapter_config.json but no tensors are saved for them, so we just skip those.
22def translate(fqn):
23 if fqn.startswith("language_model.model."):
24 return "language_model.backbone." + fqn[len("language_model.model."):]
25 return None
26
27cfg = json.loads((Path(ADAPTER) / "adapter_config.json").read_text())
28scale = cfg["lora_alpha"] / cfg["r"]
29
30pairs = {}
31with safe_open(str(Path(ADAPTER) / "adapter_model.safetensors"), framework="pt") as f:
32 for k in f.keys():
33 m = re.match(r"^base_model\.model\.(.+)\.lora_(A|B)\.weight$", k)
34 if m:
35 pairs.setdefault(m.group(1), {})[m.group(2)] = f.get_tensor(k)
36
37modules = dict(model.named_modules())
38for wrapper_fqn, ab in pairs.items():
39 hf_fqn = translate(wrapper_fqn)
40 if hf_fqn is None or hf_fqn not in modules:
41 continue
42 W = modules[hf_fqn].weight
43 A = ab["A"].to(device=W.device, dtype=torch.float32)
44 B = ab["B"].to(device=W.device, dtype=torch.float32)
45 with torch.no_grad():
46 W.add_(((B @ A) * scale).to(W.dtype))
47
48model.eval()
49# ... then run the same generate() loop as in the SFT example above.

Resources — single GPU; ~60 GB GPU RAM for the bf16 30B base. Runtime — ~75 s base load + ~1 s LoRA merge + ~5–15 s per sample.


Step 5 — Results Comparison

Evaluation on 5 CORD-v2 Validation Samples

Full SFT (lr=1e-4, 400 steps, epoch_3_step_399)

SampleGround TruthPredictionMatch
1<s_total>...<s_nm>REAL GANACHE</s_nm>...<s_nm>EGG TART</s_nm>...<s_nm>PIZZA TOAST</s_nm>...Exact match100%
2<s_total>...<s_nm>JAMUR</s_nm>...<s_nm>TAHU</s_nm>...Exact match100%
3<s_total>...<s_nm>Gojek Chicken Chilli Sauce H</s_nm>...Correct values, slight name segmentation diff33%
4<s_total>...<s_nm>VANILLA CHOCO HEART CAKE</s_nm>...Exact match100%
5<s_total>...<s_nm>Sate Padang</s_nm>...Correct, extra <s_unitprice> field~0%

3/5 exact matches. All samples produce correct structured output.

LoRA PEFT (rank=64, lr=1e-3, 400 steps, epoch_0_step_99)

SampleGround TruthPredictionMatch
1<s_total>...<s_nm>REAL GANACHE</s_nm>...Exact match100%
2<s_total>...<s_nm>JAMUR</s_nm>...<s_nm>TAHU</s_nm>...Exact match100%
3<s_total>...<s_nm>Gojek Chicken Chilli Sauce H</s_nm>...Correct values, slight name segmentation diff33%
4<s_total>...<s_nm>VANILLA CHOCO HEART CAKE</s_nm>...Exact match100%
5<s_total>...<s_nm>Sate Padang</s_nm>...Exact match100%

4/5 exact matches. All samples produce correct structured output.

Summary

Full SFTLoRA PEFT
Trainable params31.5B (95.63%)55M (0.17%)
Learning rate1e-41e-3
GPU memory~49 GiB~30 GiB
Training time (8x H100)~10 min~6 min
Best val loss0.034 (step 199)0.045 (step 99)
Final train loss0.0040.017
Checkpoint size~64 GB~27 MB
Exact matches (5 val)3/54/5