Understanding NeMo Customizer Configurations and Models#

Learn the fundamentals of NeMo Customizer configurations and models to make informed decisions about your fine-tuning projects. This tutorial covers what configurations are available, which models you can use, and how to choose the right approach for your use case.

Understanding these basics will help you navigate the fine-tuning process more effectively and avoid common configuration issues. If you’re ready to start fine-tuning immediately, you can jump to Format Training Dataset after completing this tutorial.

Note

The time to complete this tutorial is approximately 15 minutes. This tutorial focuses on understanding and discovery—no actual training jobs are created.

Prerequisites#

Platform Prerequisites#

New to using NeMo microservices?

NeMo microservices use an entity management system to organize all resources—including datasets, models, and job artifacts—into namespaces and projects. Without setting up these organizational entities first, you cannot use the microservices.

If you’re new to the platform, complete these foundational tutorials first:

  1. Get Started Tutorials: Learn how to deploy, customize, and evaluate models using the platform end-to-end

  2. Set Up Organizational Entities: Learn how to create namespaces and projects to organize your work

If you’re already familiar with namespaces, projects, and how to upload datasets to the platform, you can proceed directly with this tutorial.

Learn more: Entity Concepts

NeMo Customizer Prerequisites#

Microservice Setup Requirements and Environment Variables

Before starting, make sure you have:

  • Access to NeMo Customizer

  • The huggingface_hub Python package installed

  • (Optional) Weights & Biases account and API key for enhanced visualization

Set up environment variables:

# Set up environment variables
export CUSTOMIZER_BASE_URL="<your-customizer-service-url>"
export ENTITY_HOST="<your-entity-store-url>"
export DS_HOST="<your-datastore-url>"
export NAMESPACE="default"
export DATASET_NAME="test-dataset"

# Hugging Face environment variables (for dataset/model file management)
export HF_ENDPOINT="${DS_HOST}/v1/hf"
export HF_TOKEN="dummy-unused-value"  # Or your actual HF token

# Optional monitoring
export WANDB_API_KEY="<your-wandb-api-key>"

Replace the placeholder values with your actual service URLs and credentials.


What Are Customization Configurations?#

Customization configurations are pre-built recipes that combine three key elements:

  1. Model: The AI model you want to customize (Llama, Phi, embedding models, etc.)

  2. Hardware: The GPU requirements and parallelization settings

  3. Training Options: Available training types (LoRA, Full SFT, DPO, etc.)

Think of configurations like cooking recipes—they specify the ingredients (model), equipment (hardware), and cooking methods (training types) needed to achieve your desired result.

Understanding Model Names vs. Configuration Names#

It’s important to understand the difference between how models are referenced and how configurations are named. These serve different purposes in the NeMo Customizer ecosystem.

Model names include the organization prefix and identify the actual AI model:

meta/llama-3.1-8b-instruct
│    │
│    └─ Model name and variant
└─ Organization (Hugging Face namespace)

This is how the model is referenced in Hugging Face and in the configuration’s base_model field.

Configuration names follow a specific pattern that tells you important information:

llama-3.1-8b-instruct@v1.0.0+A100
│                          │
│                          └─ Hardware target
│                    └─ Version
└─ Model identifier (without org prefix)

Configuration names use simplified model identifiers for brevity and consistency across the platform.

Note

Namespace vs. Organization: Configuration namespaces and model organizations serve different purposes and are independent:

  • Configuration namespace: User/admin-defined namespace where the config is stored (often defaults to "default")

  • Model organization: The Hugging Face organization that owns the model (like meta/ in the base_model)

Important

Hardware Compatibility: Configurations marked as +A100 are fully compatible with B200 GPUs. The naming reflects the original target hardware, but the underlying resource requirements work across compatible GPU families.


Discovering Available Configurations#

List All Enabled Configurations#

Start by seeing what configurations are immediately available to you:

from nemo_microservices import NeMoMicroservices
import os

# Initialize the client
client = NeMoMicroservices(
    base_url=os.environ['CUSTOMIZER_BASE_URL']
)
# Get all enabled configurations
configs = client.customization.configs.list(
    filter={"enabled": True}
)

print(f"You have {len(configs.data)} configurations available:")
for config in configs.data:
    print(f"  • {config.name}")
    print(f"    Description: {config.description}")
    print(f"    Training options: {len(config.training_options)}")
    for option in config.training_options:
        print(f"      - {option.training_type}/{option.finetuning_type}: {option.num_gpus} GPUs")
    print()
# List all enabled configurations
curl "${CUSTOMIZER_BASE_URL}/v1/customization/configs" \
  --data-urlencode "filter[enabled]=true" \
  --data-urlencode "page_size=50" | jq '.data[] | {name, description, training_options}'

Discover All Configurations (Including Disabled)#

To see the full range of possibilities, including configurations that might be available but currently disabled:

# List ALL configurations (enabled and disabled)
all_configs = client.customization.configs.list(page_size=50)

print(f"Total configurations in your environment: {len(all_configs.data)}")

enabled_count = sum(1 for config in all_configs.data if config.target.enabled)
disabled_count = len(all_configs.data) - enabled_count

print(f"  ✓ Enabled: {enabled_count}")
print(f"  ✗ Disabled: {disabled_count}")

# Show disabled configurations
print(f"\nDisabled configurations (contact admin to enable):")
for config in all_configs.data:
    if not config.target.enabled:
        print(f"  • {config.name} - {config.description}")
# List all configurations
curl "${CUSTOMIZER_BASE_URL}/v1/customization/configs" \
  --data-urlencode "page_size=50" | jq

# List only disabled configurations
curl "${CUSTOMIZER_BASE_URL}/v1/customization/configs" \
  --data-urlencode "filter[enabled]=false" \
  --data-urlencode "page_size=50" | jq '.data[] | {name, description}'
Example Response
{
  "object": "list",
  "data": [
    {
      "name": "meta/llama-3.2-1b-instruct@v1.0.0+A100",
      "namespace": "default",
      "dataset_schemas": [
        {
          "title": "Newline-Delimited JSON File",
          "type": "array",
          "items": {
            "description": "Schema for Supervised Fine-Tuning (SFT) training data items.",
            "properties": {
              "prompt": {
                "description": "The prompt for the entry",
                "title": "Prompt",
                "type": "string"
              },
              "completion": {
                "description": "The completion to train on",
                "title": "Completion",
                "type": "string"
              }
            },
            "required": ["prompt", "completion"],
            "title": "SFTDatasetItemSchema",
            "type": "object"
          }
        }
      ],
      "training_options": [
        {
          "training_type": "sft",
          "finetuning_type": "lora",
          "num_gpus": 1,
          "num_nodes": 1,
          "tensor_parallel_size": 1,
          "use_sequence_parallel": false
        },
        {
          "training_type": "sft",
          "finetuning_type": "all_weights",
          "num_gpus": 1,
          "num_nodes": 1,
          "tensor_parallel_size": 1,
          "use_sequence_parallel": false
        }
      ]
    },
    {
      "name": "nvidia/llama-3.2-nv-embedqa-1b@v2+A100",
      "namespace": "nvidia",
      "dataset_schemas": [
        {
          "title": "Newline-Delimited JSON File",
          "type": "array",
          "items": {
            "description": "Schema for embedding training data items.",
            "properties": {
              "query": {
                "description": "The query to use as an anchor",
                "title": "Query",
                "type": "string"
              },
              "pos_doc": {
                "description": "A document that should match positively with the anchor",
                "title": "Positive Document",
                "type": "string"
              },
              "neg_doc": {
                "description": "Documents that should not match with the anchor",
                "title": "Negative Documents",
                "type": "array",
                "items": {"type": "string"}
              }
            },
            "required": ["query", "pos_doc", "neg_doc"],
            "title": "EmbeddingDatasetItemSchema",
            "type": "object"
          }
        }
      ],
      "training_options": [
        {
          "training_type": "sft",
          "finetuning_type": "lora_merged",
          "num_gpus": 1,
          "num_nodes": 1,
          "tensor_parallel_size": 1,
          "use_sequence_parallel": false
        }
      ]
    }
  ]
}

Note

Configuration vs. Target Architecture: Configurations don’t have their own enabled field. Instead, they inherit their availability from their underlying customization targets. When you see config.target.enabled, you’re checking whether the target model that the configuration references is enabled. This architecture allows administrators to control model availability at the target level, which affects all configurations that use that target.


Understanding Model Types and Capabilities#

Language Models#

These models are designed for text generation, instruction following, and conversational AI:

Language Model Options#

Model Family

Description

Examples

Llama Models

General-purpose language models excellent for instruction following, conversation, and text generation tasks

llama-3.1-8b-instruct, llama-3.2-1b-instruct

Llama Nemotron Models

NVIDIA’s specialized variants optimized for specific use cases with enhanced reasoning capabilities

Various Nano and Super variants

Phi Models

Microsoft’s efficient models designed for strong reasoning with optimized deployment characteristics

Phi model family configurations

GPT-OSS Models

Open-source GPT-based models supporting Full SFT customization workflows

Various GPT-OSS configurations

Specialized Models#

Specialized Model Support#

Model Type

Status

Details

Embedding Models

✅ Supported

Model: Llama 3.2 NV EmbedQA 1B for question-answering and retrieval tasks

Use Cases: Semantic search, document retrieval, question-answering systems, RAG pipelines

Note: Typically disabled by default—contact your administrator for access

Reranking Models

❌ Not Supported

Alternative: Use embedding models for retrieval tasks, or implement reranking in your application layer

Custom Models#

You can import and fine-tune models from the Hugging Face Transformers library:

  • Supported: Any model compatible with the Hugging Face Transformers architecture

  • Process: Import via the private HuggingFace model tutorial

  • Limitations: Some architectures (like Conv1D-based models) are not compatible


Training Types and Resource Requirements#

Available Training Approaches#

Training Type Comparison#

Training Type

Resource Usage

Training Speed

Flexibility

Best For

LoRA

Low (1-2 GPUs)

Fast

Good

Experiments, quick iterations

Full SFT

High (2-8 GPUs)

Slower

Maximum

Production, maximum performance

DPO

Medium (2-4 GPUs)

Medium

Specialized

Preference alignment

Knowledge Distillation

Medium (varies)

Medium

Specialized

Model compression

Checking Resource Requirements#

Use this approach to understand the GPU requirements for different configurations:

# Analyze resource requirements across configurations
configs = client.customization.configs.list(filter={"enabled": True})

print("Resource Requirements Summary:")
print("=" * 50)

for config in configs.data:
    print(f"\n📋 {config.name}")
    print(f"   Base Model: {config.target.base_model}")

    for option in config.training_options:
        gpu_total = option.num_gpus * option.num_nodes
        print(f"   • {option.training_type.upper()}/{option.finetuning_type.upper()}: {gpu_total} total GPUs")
        print(f"     ({option.num_gpus} GPUs × {option.num_nodes} nodes)")

# Find configurations that fit your hardware
available_gpus = 2  # Adjust based on your setup
print(f"\n🔍 Configurations that fit {available_gpus} GPUs:")

for config in configs.data:
    for option in config.training_options:
        if option.num_gpus <= available_gpus:
            print(f"   ✓ {config.name} - {option.training_type}/{option.finetuning_type}")
# Get resource information for all configurations
curl "${CUSTOMIZER_BASE_URL}/v1/customization/configs" \
  --data-urlencode "filter[enabled]=true" | \
  jq '.data[] | {
    name: .name,
    base_model: .target.base_model,
    training_options: .training_options | map({
      type: "\(.training_type)/\(.finetuning_type)",
      gpus: .num_gpus,
      nodes: .num_nodes,
      total_gpus: (.num_gpus * .num_nodes)
    })
  }'

Training Configuration Impact on Deployment#

Understanding how your training configuration choices affect deployment is crucial for planning your fine-tuning strategy. The parallelism and resource settings you choose during training have direct implications for how your models can be deployed and used.

Parallelism Parameters Explained#

When you examine training options in configurations, you’ll see several parallelism parameters that control how training workloads are distributed across GPUs:

Training Parallelism Parameters#

Parameter

Purpose

Impact on Training

tensor_parallel_size

Distributes model tensors across GPUs

Higher values reduce memory per GPU but require more GPUs

pipeline_parallel_size

Distributes model layers across GPUs

Enables training larger models by splitting layers

use_sequence_parallel

Enables sequence-level parallelism

Reduces memory usage for long sequences

Resource Allocation Rules#

Training configurations must satisfy mathematical constraints to work properly:

Important

GPU Allocation Rule: The total number of GPUs (num_gpus × num_nodes) must be a multiple of: tensor_parallel_size × pipeline_parallel_size × expert_model_parallel_size

If this constraint isn’t met, your training job will fail with a validation error.

Example Calculations:

  • 8 GPUs with tensor_parallel_size=4, pipeline_parallel_size=2 ✅ Valid (8 = 4 × 2 × 1)

  • 4 GPUs with tensor_parallel_size=4, pipeline_parallel_size=2 ❌ Invalid (4 ≠ 4 × 2 × 1)

Model Artifact Types and Deployment Paths#

Your training choices determine how the resulting model can be deployed:

Training Type to Deployment Mapping#

Training Type

Model Artifact

Deployment Method

Key Environment Variable

LoRA

Adapter weights only

Uses base model + adapters

NIM_PEFT_SOURCE

Full SFT

Complete model weights

Standalone model deployment

NIM_FT_MODEL

DPO

Complete model weights

Standalone model deployment

NIM_FT_MODEL

Deployment Architecture Overview#

The platform uses different deployment strategies based on your training approach:

Architecture: Base model + adapter loading

  • Base model remains unchanged

  • Adapters loaded dynamically from Entity Store

  • Multiple adapters can share the same base model

  • Lower storage and memory requirements

Environment Configuration:

NIM_PEFT_SOURCE=http://nemo-entity-store:8000
NIM_PEFT_REFRESH_INTERVAL=30

Architecture: Complete model replacement

  • Entire model weights replaced with fine-tuned version

  • Requires dedicated deployment resources

  • Higher storage and memory requirements

  • Maximum customization flexibility

Environment Configuration:

NIM_FT_MODEL=/model-store
NIM_CUSTOM_MODEL=/model-store

Planning Your Training Strategy#

Consider these factors when choosing training configurations:

For Experimentation:

  • Choose LoRA with lower parallelism settings

  • Faster iteration cycles

  • Lower resource requirements

  • Easy to compare multiple approaches

For Production Deployment:

  • Consider full SFT for maximum performance

  • Plan for higher deployment resource requirements

  • Factor in model storage and loading times

  • Evaluate whether adapter flexibility is needed

Note

Deployment Guidance: For detailed information about deploying your fine-tuned models, including manual deployment options outside the NeMo platform, refer to the inference deployment documentation.


Making Configuration Decisions#

Decision Framework#

Use this framework to choose the right configuration for your project:

flowchart TD A[What's your task?] --> B{Text Generation?} A --> C{Q&A/Retrieval?} A --> D{Custom Model?} B -->|Yes| E[Language Models<br/>Llama, Phi, Nemotron] C -->|Yes| F[Embedding Models<br/>Llama 3.2 NV EmbedQA] D -->|Yes| G[Import from<br/>Hugging Face] E --> H{Resource Constraints?} F --> I[Contact Admin if Disabled] G --> J[Follow Import Tutorial] H -->|Low Resources<br/>Quick Experiment| K[Choose LoRA Config] H -->|High Resources<br/>Production Use| L[Choose Full SFT Config] K --> M[1-2 GPUs Needed] L --> N[4-8 GPUs Needed]

Example: Choosing a Configuration#

Let’s walk through a realistic example:

Scenario: You want to create an email writing assistant

Step 1: Identify Task Type

  • Task: Text generation (email writing)

  • Model family: Language models (Llama, Phi, etc.)

Step 2: Assess Resources

  • Available hardware: 2 A100 GPUs

  • Timeline: Need results within a day

  • Budget: Limited GPU hours

Step 3: Choose Training Type

  • Constraint: Limited resources and time

  • Choice: LoRA training (parameter-efficient)

Step 4: Find Matching Configuration

# Find LoRA configurations that fit 2 GPUs
suitable_configs = []

for config in configs.data:
    for option in config.training_options:
        if (option.finetuning_type == "lora" and
            option.num_gpus <= 2):
            suitable_configs.append({
                'name': config.name,
                'base_model': config.target.base_model,
                'gpus': option.num_gpus
            })

print("Suitable configurations for your use case:")
for config in suitable_configs:
    print(f"  ✓ {config['name']} ({config['gpus']} GPUs)")

Result: Choose llama-3.2-1b-instruct@v1.0.0+A100 with LoRA training


Getting Help with Configurations#

When Configurations Are Disabled#

If you find a configuration that meets your needs but is disabled:

  1. Note the exact configuration name (e.g., llama-3.1-8b-instruct@v1.0.0+A100)

  2. Contact your cluster administrator with a specific request

  3. Provide context about your use case and why you need this configuration

Example Request Email:

Subject: Enable NeMo Customizer Configuration Request

Hi [Admin Name],

I need access to the following configuration for my project:
Configuration: llama-3.1-8b-instruct@v1.0.0+A100

Use Case: Fine-tuning a customer support chatbot
Training Type: LoRA (low resource requirements)
Timeline: Need to start training this week

This configuration appears to be available but disabled. Could you please enable it for our team?

Thanks!

Administrator Resources#

If you’re an administrator, refer to the configuration management documentation for guidance on:

  • Creating new configurations

  • Enabling/disabling configurations

  • Managing hardware resource allocation

  • Setting up configurations for different user groups


Next Steps#

Now that you understand NeMo Customizer configurations and models, you’re ready to proceed with fine-tuning:

Format Training Dataset

Learn how to prepare your data for the model type you’ve chosen.

Format Training Dataset
Start a LoRA Job

Begin with parameter-efficient fine-tuning if you chose a LoRA configuration.

Start a LoRA Model Customization Job
Start a Full SFT Job

Use full supervised fine-tuning if you chose an all_weights configuration.

Start a Full SFT Customization Job
Import Custom Models

Import and fine-tune private Hugging Face models.

Import and Fine-Tune Private HuggingFace Models

Key Takeaways#

Configurations combine model + hardware + training options in pre-built recipes
A100 configurations work on B200 hardware - compatibility is built-in
LoRA requires fewer resources than Full SFT but offers less customization flexibility
Training parallelism settings affect deployment requirements - plan accordingly
LoRA uses adapters (NIM_PEFT_SOURCE) while Full SFT uses complete models (NIM_FT_MODEL)
GPU allocation must satisfy mathematical constraints for training to succeed
Disabled configurations can often be enabled by contacting your administrator
Embedding models are supported for Q&A and retrieval tasks
Reranking models are not currently supported - use embedding models instead
Custom Hugging Face models can be imported with some architectural limitations

Quick Reference Commands#

# List enabled configurations
client.customization.configs.list(filter={"enabled": True})

# List all configurations (including disabled)  
client.customization.configs.list()

# List disabled configurations
client.customization.configs.list(filter={"enabled": False})

# Check resource requirements
for config in configs.data:
    for option in config.training_options:
        print(f"{config.name}: {option.finetuning_type} needs {option.num_gpus} GPUs")

You now have the foundation to make informed decisions about your fine-tuning projects and navigate the NeMo Customizer ecosystem effectively.