Understanding NeMo Customizer: Models, Training, and Resources#

Learn the fundamentals of how NeMo Customizer works to make informed decisions about your fine-tuning projects. This tutorial covers how models are organized, how adapters attach to base models, training types and GPU requirements, and how to choose the right approach for your use case.

Understanding these basics will help you navigate the fine-tuning process more effectively and avoid common issues. If you’re ready to start fine-tuning immediately, you can jump to SFT Customization Job after completing this tutorial.

Note

The time to complete this tutorial is approximately 15 minutes. This tutorial focuses on understanding and discovery—no actual training jobs are created.

Prerequisites#

Platform Prerequisites#

NeMo Customizer Prerequisites#

Core Concepts#

What is a Model Entity?#

A Model Entity represents a model registered in the NeMo Platform. It contains:

FileSet Reference: Points to the model checkpoint files (weights, config, tokenizer)
Model Spec: Auto-populated metadata about the model architecture (layers, parameters, etc.)
Adapters: LoRA or other parameter-efficient fine-tuning weights attached to this model
Base Model Link: Optional reference to a parent model (for fine-tuned models)

Think of a Model Entity as a “model card” that tracks everything about a model—where its files are, what architecture it uses, and what adapters have been trained for it.

What is an Adapter?#

An Adapter is a set of parameter-efficient fine-tuning weights (like LoRA) that are attached to a Model Entity. Adapters:

Are nested within the parent Model Entity
Are enabled for inference by default post training
Have their own FileSet for storing the adapter weights
Track metadata like finetuning type, rank, and alpha values

What is a FileSet?#

A FileSet is a collection of files stored in the platform’s file service. For customization:

Model FileSet: Contains the base model checkpoint (config, weights, tokenizer)
Adapter FileSet: Contains the LoRA adapter weights
Dataset FileSet: Contains training and validation data

The Customization Workflow#

        flowchart LR
    A[1. Create FileSet<br/>with model files] --> B[2. Create Model Entity<br/>pointing to FileSet]
    B --> C[3. Create Customization Job<br/>referencing Model Entity]
    C --> D{Training Type?}
    D -->|LoRA| E[Adapter created and<br/>attached to Model Entity]
    D -->|Full SFT| F[New Model Entity<br/>with customized weights]
    E --> G[Auto-deploy to NIM<br/>if enabled]

Step-by-Step Breakdown#

1. Create a FileSet for your base model

Upload your model checkpoint files (from HuggingFace, NGC, or local storage) to a FileSet:

import os
from nemo_platform import NeMoPlatform
from nemo_platform.types.files import HuggingfaceStorageConfigParam

client = NeMoPlatform(
    base_url=os.environ.get("NMP_BASE_URL", "http://localhost:8080"),
    workspace="default",
)

# Create a FileSet from HuggingFace
fileset = client.files.filesets.create(
    workspace="default",
    name="llama-3-2-1b",
    description="Llama 3.2 1B base model",
    storage=HuggingfaceStorageConfigParam(
        type="huggingface",
        repo_id="meta-llama/Llama-3.2-1B-Instruct",
        repo_type="model",
        token_secret="my-hf-token"  # Secret containing HuggingFace token
    )
)

2. Create a Model Entity pointing to the FileSet

model = client.models.create(
    workspace="default",
    name="llama-3-2-1b",
    fileset="default/llama-3-2-1b",  # Reference to the FileSet
    description="Llama 3.2 1B base model"
)

# Wait for model spec to be auto-populated
import time
while not model.spec:
    time.sleep(5)
    model = client.models.retrieve(workspace="default", name="llama-3-2-1b")

print(f"Model architecture: {model.spec.family}")
print(f"Parameters: {model.spec.base_num_parameters:,}")

3. Create a Customization Job

job = client.customization.jobs.create(
    workspace="default",
    name="my-email-assistant-lora",
    spec={
        "model": "default/llama-3-2-1b",
        "dataset": "fileset://default/email-training-data",
        "training": {
            "type": "sft",
            "peft": {"type": "lora", "rank": 8, "alpha": 32},
            "epochs": 3,
            "batch_size": 32
        },
        "deployment_config": {"lora_enabled": True}
    }
)

4. Access the Result

After training completes:

# For LoRA jobs - adapter is attached to the model
model = client.models.retrieve(workspace="default", name="llama-3-2-1b")

for adapter in model.adapters:
    print(f"Adapter: {adapter.name}")
    print(f"  Type: {adapter.finetuning_type}")
    print(f"  Enabled: {adapter.enabled}")
    print(f"  Files: {adapter.fileset}")

Understanding Adapters and Deployment#

How Adapters Work#

When you run a LoRA customization job:

Training produces adapter weights (small compared to base model)
Adapter created and attached to the parent Model Entity
FileSet created with the adapter weights
Enabled by default so NIMs serving the base model automatically load the adapter

Viewing Adapters on a Model#

model = client.models.retrieve(workspace="default", name="llama-3-2-1b")

print(f"Model: {model.name}")
print(f"Adapters:")
for adapter in model.adapters or []:
    print(f"  - {adapter.name}")
    print(f"    Type: {adapter.finetuning_type}")
    print(f"    Enabled: {adapter.enabled}")
    print(f"    Created: {adapter.created_at}")

Disabling and Re-enabling Adapters#

Adapters are enabled by default, but you can disable an adapter to remove it from inference without deleting it. When you set enabled=False, the sidecar running alongside the NIM automatically removes the adapter’s files on its next reconciliation pass (every few seconds). Re-enabling the adapter causes the sidecar to re-download and serve it again.

client.models.update_adapter(
    model_name="llama-3-2-1b",
    workspace="default",
    adapter_name="my-custom-lora",
    enabled=False,
)

To re-enable:

client.models.update_adapter(
    model_name="llama-3-2-1b",
    workspace="default",
    adapter_name="my-custom-lora",
    enabled=True,
)

Creating Adapters Manually#

You can also create adapters manually (e.g., from externally trained weights):

model = client.models.create_adapter(
    model_name="llama-3-2-1b",
    workspace="default",
    name="my-custom-lora",
    fileset="default/my-lora-weights",
    finetuning_type="lora",
)

Training Types and Resource Requirements#

Available Training Approaches#

Table 1 Training Type Comparison#
Training Type	Output	GPUs Needed	Speed	Best For
LoRA	Adapter on parent Model	1-2 GPUs	Fast	Experiments, quick iterations, multiple adapters per model
Full SFT	New Model Entity	4-8+ GPUs	Slower	Production, maximum performance
DPO	New Model Entity	2-4 GPUs	Medium	Preference alignment, RLHF-style training

GPU Memory Guidelines#

Table 2 Estimated GPU Requirements by Model Size#
Model Size	LoRA (min)	Full Fine-tuning (min)	Notes
1B	1 × 16GB	1 × 24GB	Small models, quick experiments
3B	1 × 24GB	2 × 24GB	Good balance of capability/resources
7-8B	1 × 40GB	2-4 × 80GB	Popular model size
13B	1 × 80GB	4 × 80GB	Large models
70B	2 × 80GB	8+ × 80GB	Very large models, enterprise use

Storage Requirements#

Customization jobs consume disk space on the platform’s shared persistent volume for model files, finetuning checkpoints, and the final output artifact. Required space depends on the training type:

Table 3 Estimated Storage by Training Type#
Training Type	Peak Disk Usage	Breakdown
Full SFT	~3× model size	Base model download (1×) + finetuning checkpoint (1×) + output model copy (1×)
LoRA	~1.5× model size	Base model download (1×) + small adapter checkpoints + adapter output
LoRA (merged)	~2× model size	Base model download (1×) + temporary merged weights (~1×)
DPO / GRPO	~3× model size + `/tmp`	Similar to Full SFT, plus ephemeral node storage for Ray workers

Important

These estimates cover model weights only and do not include training dataset size. If the platform disk fills during a job, the job fails with an I/O error and the job service may return a 500 status when you retrieve logs.

Ensure your platform’s shared persistent volume has at least 3× the base model size of free space before starting a full SFT job, or 1.5× for LoRA jobs.

For troubleshooting disk-related failures, see Troubleshooting NeMo Customizer.

Parallelism Parameters Explained#

Parallelism is configured via training.parallelism. These parameters control how training workloads are distributed across GPUs:

Table 4 Training Parallelism Parameters#
Parameter	Purpose	Impact on Training
num_gpus_per_node	Number of GPUs per node	Total GPUs = `num_gpus_per_node × num_nodes`
tensor_parallel_size	Distributes model tensors across GPUs	Higher values reduce memory per GPU but require more GPUs
pipeline_parallel_size	Distributes model layers across GPUs	Enables training larger models by splitting layers
sequence_parallel	Enables sequence-level parallelism	Reduces memory usage for long sequences
expert_parallel_size	Distributes expert layers across GPUs for Mixture of Experts (MoE) models	Controls how expert parameters are partitioned across devices

data_parallel_size is automatically derived as total_gpus / (TP × PP × CP) and is not set directly.

Important

Recommended parallelism for Experts (MoE) Models:

The expert_parallel_size parameter is used to parallelize a Mixture of Experts (MoE) model’s experts across GPUs. For non-MoE models, this parameter is ignored. A model’s model card will indicate if it is a Mixture of Experts model and specifies its number of experts.

The number of experts in the model must be divisible by expert_parallel_size. For example, if a model has 8 experts, setting expert_parallel_size=4 results in each GPU processing 2 experts.

Also, the value of expert_parallel_size must evenly divide the derived data_parallel_size, which is automatically calculated as data_parallel_size = total GPUs / (tensor_parallel_size × pipeline_parallel_size × context_parallel_size).

For example, with 8 total GPUs, tensor_parallel_size=2, and pipeline_parallel_size=1:

Derived data_parallel_size = 8 / (2 × 1 × 1) = 4
Valid expert_parallel_size values: 1, 2, or 4 (must evenly divide 4)
Invalid expert_parallel_size value: 3 (does not evenly divide 4)

Resource Allocation Rules#

Training configurations must satisfy mathematical constraints to work properly:

Important

GPU Allocation Rule: The total number of GPUs (num_gpus_per_node x num_nodes) must be a multiple of: tensor_parallel_size × pipeline_parallel_size × context_parallel_size

If this constraint isn’t met, your training job will fail with a validation error.

Example Calculations:

8 GPUs with tensor_parallel_size=4, pipeline_parallel_size=2 ✅ Valid (8 = 4 × 2 × 1)
4 GPUs with tensor_parallel_size=4, pipeline_parallel_size=2 ❌ Invalid (4 ≠ 4 × 2 × 1)

Choosing Your Training Approach#

Decision Framework#

        flowchart TD
    A[What's your goal?] --> B{Need maximum<br/>performance?}
    B -->|Yes| C{Have 4+ GPUs?}
    B -->|No| D[Choose LoRA]
    
    C -->|Yes| E[Choose Full SFT]
    C -->|No| D
    
    D --> F{Multiple<br/>use cases?}
    F -->|Yes| G[Train multiple<br/>LoRA adapters]
    F -->|No| H[Single LoRA<br/>adapter]
    
    E --> I[New Model Entity<br/>with full weights]
    G --> J[Multiple adapters<br/>on same Model Entity]
    H --> J

When to Use LoRA#

✅ Choose LoRA when:

You have limited GPU resources (1-2 GPUs)
You want fast training iterations
You need multiple specialized versions of the same base model
You want to auto-deploy adapters to existing NIM deployments

When to Use Full Fine-Tuning#

✅ Choose Full SFT when:

You need maximum model performance
You have sufficient GPU resources (4+ GPUs)
You want complete control over all model weights
You’re preparing a production deployment

Model Types and Capabilities#

Supported Language Models#

Table 5 Language Model Options#
Model Family	Description	Examples
Llama Models	General-purpose language models excellent for instruction following, conversation, and text generation tasks	`llama-3.1-8b-instruct`, `llama-3.2-1b-instruct`
Llama Nemotron Models	NVIDIA’s specialized variants optimized for specific use cases with enhanced reasoning capabilities	Various Nano and Super variants
Phi Models	Microsoft’s efficient models designed for strong reasoning with optimized deployment characteristics	Phi model family configurations
GPT-OSS Models	Open-source GPT-based models supporting Full SFT customization workflows	Various GPT-OSS configurations

Specialized Models#

Table 6 Specialized Model Support#
Model Type	Status	Details
Embedding Models	✅ Supported	Model: Llama 3.2 NV EmbedQA 1B for question-answering and retrieval tasks Use Cases: Semantic search, document retrieval, question-answering systems, RAG pipelines Note: Typically disabled by default—contact your administrator for access
Reranking Models	❌ Not Supported	Alternative: Use embedding models for retrieval tasks, or implement reranking in your application layer

Importing Custom Models#

You can import any HuggingFace-compatible model:

from nemo_platform.types.files import HuggingfaceStorageConfigParam

# Create FileSet from HuggingFace
fileset = client.files.filesets.create(
    workspace="default",
    name="my-custom-model",
    storage=HuggingfaceStorageConfigParam(
        type="huggingface",
        repo_id="organization/model-name",
        repo_type="model",
        token_secret="my-hf-token"
    )
)

# Create Model Entity
model = client.models.create(
    workspace="default",
    name="my-custom-model",
    fileset="default/my-custom-model"
)

For detailed guidance, see Import HuggingFace Model.

Next Steps#

Now that you understand how Model Entities and Adapters work, you’re ready to proceed:

Format Training Dataset

Learn how to prepare your data for fine-tuning.

Next Step: Data preparation

Format Training Dataset

Start a LoRA Job

Create a parameter-efficient LoRA adapter.

Best for: Experimentation, limited resources

LoRA Model Customization Job

Start a Full SFT Job

Use full supervised fine-tuning for maximum performance.

Best for: Production use

Full SFT Customization

Import Custom Models

Import and fine-tune private HuggingFace models.

Best for: Custom or proprietary models

Import and Fine-Tune Private HuggingFace Models

Key Takeaways#

✅ Model Entities contain model metadata and point to FileSet with checkpoint files
✅ Adapters (LoRA) are attached to Model Entities, not stored separately
✅ FileSet is where actual model/adapter files are stored
✅ LoRA training creates an adapter on the parent Model Entity
✅ Full SFT training creates a new Model Entity with full weights
✅ Adapters are enabled by default and automatically loaded by NIMs serving the base model
✅ GPU requirements vary significantly between LoRA and full fine-tuning
✅ Custom HuggingFace models can be imported via FileSet + Model Entity

Quick Reference Commands#

# List all Model Entities
models = client.models.list(workspace="default")

# Get a specific Model Entity with adapters
model = client.models.retrieve(workspace="default", name="llama-3-2-1b")

# Create a customization job
job = client.customization.jobs.create(
    workspace="default",
    name="my-job",
    spec={
        "model": "default/llama-3-2-1b",
        "dataset": "fileset://default/my-dataset",
        "training": {
            "type": "sft",
            "peft": {"type": "lora"}
        }
    }
)

# Add an adapter to a model
client.models.create_adapter(
    model_name="llama-3-2-1b",
    workspace="default",
    name="my-adapter",
    fileset="default/adapter-weights",
    finetuning_type="lora"
)

You now have the foundation to make informed decisions about your fine-tuning projects!