Fine-Tuning NemotronOmni on CORD-v2 Receipts — End-to-End Guide
Fine-Tuning NemotronOmni on CORD-v2 Receipts — End-to-End Guide
A step-by-step guide for fine-tuning NemotronOmni (33B MoE) to extract structured receipt data from scanned images using NeMo Automodel. Covers both full SFT and LoRA PEFT.
What is NemotronOmni?
NemotronOmni (NemotronH_Nano_Omni_Reasoning_V3) is a ~33B multimodal MoE model supporting
image, video, and audio inputs.
Key architectural details:
- LLM backbone: NemotronV3 hybrid Mamba2 + Attention + MoE, 52 layers, hidden dim 2688
- Vision encoder: RADIO v2.5-H (ViT-Huge), 256 vision tokens per tile
- Audio encoder: Parakeet FastConformer (1024-dim)
- MoE: 128 experts per MoE layer, top-6 routing with sigmoid gating
- Total parameters: 33B (31.5B trainable with frozen vision/audio towers)
Fine-Tune for Receipt Field Extraction
We fine-tune NemotronOmni on the CORD-v2 (Consolidated Receipt Dataset) to extract structured fields from scanned receipts:
The base model produces free-form descriptions. After fine-tuning, it outputs structured XML-like token sequences matching the receipt fields.
Guide Overview
Hardware Requirements
- 8x H100 80 GB GPUs required (MoE with EP=8)
- SFT memory: ~49 GiB per GPU
- LoRA memory: ~30 GiB per GPU
- Estimated training time: ~10 min on 8x H100 (400 steps, 800 training samples)
Step 0 — Set Up the Environment
NemotronOmni requires mamba_ssm, causal_conv1d, and decord packages, which are included in the NeMo AutoModel container.
Step 1 — Explore the CORD-v2 Dataset
CORD-v2 contains scanned receipts with structured ground-truth JSON labels.
Expected output:
Target Format: JSON-to-Token Conversion
NeMo Automodel converts structured JSON into an XML-like token sequence using
the json2token() function. This is the format the model is trained to produce:
Step 2 — Training Configuration
Full SFT Config
Config file: examples/vlm_finetune/nemotron_omni/nemotron_omni_cord_v2.yaml
LoRA PEFT Config
Config file: examples/vlm_finetune/nemotron_omni/nemotron_omni_cord_v2_peft.yaml
Adds a peft: block to apply LoRA to language model linear layers only:
Collate function
NemotronOmni uses InternVL-style image handling where each <image> token in the
input is replaced by 256 vision embeddings during the model’s forward pass. The
collate function:
- Extracts images from the conversation
- Applies the chat template (which adds
<think></think>prefix for the assistant turn) - Processes images through the NemotronOmni processor
- Builds
image_flagstensors and creates training labels
Step 3 — Launch Fine-Tuning
Full SFT
LoRA PEFT
Training log — Full SFT
Training log — LoRA PEFT
Checkpoints saved
For LoRA, the checkpoint saves adapter weights instead:
Tip:
LOWEST_VALsymlink points to the checkpoint with the best validation loss.
Step 4 — Run Inference on the Fine-Tuned Model
Full SFT inference
Load the consolidated checkpoint and run inference on a handful of validation samples to spot-check structured output.
LoRA PEFT inference
NeMo Automodel saves LoRA adapters under its internal wrapper FQNs
(e.g. language_model.model.layers.X.mixer.in_proj), which differ from the HF
base model namespace (language_model.backbone.layers.X.mixer.in_proj).
To apply the adapter, merge the delta weights directly into the base model with
a small FQN translation:
Resources — single GPU; ~60 GB GPU RAM for the bf16 30B base. Runtime — ~75 s base load + ~1 s LoRA merge + ~5–15 s per sample.
Step 5 — Results Comparison
Evaluation on 5 CORD-v2 Validation Samples
Full SFT (lr=1e-4, 400 steps, epoch_3_step_399)
3/5 exact matches. All samples produce correct structured output.
LoRA PEFT (rank=64, lr=1e-3, 400 steps, epoch_0_step_99)
4/5 exact matches. All samples produce correct structured output.