Understanding NeMo Customizer Configurations and Models#

Learn the fundamentals of NeMo Customizer configurations and models to make informed decisions about your fine-tuning projects. This tutorial covers what configurations are available, which models you can use, and how to choose the right approach for your use case.

Understanding these basics will help you navigate the fine-tuning process more effectively and avoid common configuration issues. If you’re ready to start fine-tuning immediately, you can jump to Format Training Dataset after completing this tutorial.

Note

The time to complete this tutorial is approximately 15 minutes. This tutorial focuses on understanding and discovery—no actual training jobs are created.

Prerequisites#

Platform Prerequisites#

NeMo Customizer Prerequisites#

What Are Customization Configurations?#

Customization configurations are pre-built recipes that combine three key elements:

Model: The AI model you want to customize (Llama, Phi, embedding models, etc.)
Hardware: The GPU requirements and parallelization settings
Training Options: Available training types (LoRA, Full SFT, DPO, etc.)

Think of configurations like cooking recipes—they specify the ingredients (model), equipment (hardware), and cooking methods (training types) needed to achieve your desired result.

Understanding Model Names vs. Configuration Names#

It’s important to understand the difference between how models are referenced and how configurations are named. These serve different purposes in the NeMo Customizer ecosystem.

Model Names

Model names include the organization prefix and identify the actual AI model:

meta/llama-3.1-8b-instruct
│    │
│    └─ Model name and variant
└─ Organization (Hugging Face namespace)

This is how the model is referenced in Hugging Face and in the configuration’s base_model field.

Configuration Names

Configuration names follow a specific pattern that tells you important information:

llama-3.1-8b-instruct@v1.0.0+A100
│                    │      │
│                    │      └─ Hardware target
│                    └─ Version
└─ Model identifier (without org prefix)

Configuration names use simplified model identifiers for brevity and consistency across the platform.

Note

Namespace vs. Organization: Configuration namespaces and model organizations serve different purposes and are independent:

Configuration namespace: User/admin-defined namespace where the config is stored (often defaults to "default")
Model organization: The Hugging Face organization that owns the model (like meta/ in the base_model)

Important

Hardware Compatibility: Configurations marked as +A100 are fully compatible with B200 GPUs. The naming reflects the original target hardware, but the underlying resource requirements work across compatible GPU families.

Discovering Available Configurations#

List All Enabled Configurations#

Start by seeing what configurations are immediately available to you:

Python SDK

from nemo_microservices import NeMoMicroservices
import os

# Initialize the client
client = NeMoMicroservices(
    base_url=os.environ['CUSTOMIZER_BASE_URL']
)

# Get all enabled configurations
configs = client.customization.configs.list(
    filter={"enabled": True}
)

print(f"You have {len(configs.data)} configurations available:")
for config in configs.data:
    print(f"  • {config.name}")
    print(f"    Description: {config.description}")
    print(f"    Training options: {len(config.training_options)}")
    for option in config.training_options:
        print(f"      - {option.training_type}/{option.finetuning_type}: {option.num_gpus} GPUs")
    print()

cURL

# List all enabled configurations
curl "${CUSTOMIZER_BASE_URL}/v1/customization/configs" \
  --data-urlencode "filter[enabled]=true" \
  --data-urlencode "page_size=50" | jq '.data[] | {name, description, training_options}'

Discover All Configurations (Including Disabled)#

To see the full range of possibilities, including configurations that might be available but currently disabled:

Python SDK

# List ALL configurations (enabled and disabled)
all_configs = client.customization.configs.list(page_size=50)

print(f"Total configurations in your environment: {len(all_configs.data)}")

enabled_count = sum(1 for config in all_configs.data if config.target.enabled)
disabled_count = len(all_configs.data) - enabled_count

print(f"  ✓ Enabled: {enabled_count}")
print(f"  ✗ Disabled: {disabled_count}")

# Show disabled configurations
print(f"\nDisabled configurations (contact admin to enable):")
for config in all_configs.data:
    if not config.target.enabled:
        print(f"  • {config.name} - {config.description}")

cURL

# List all configurations
curl "${CUSTOMIZER_BASE_URL}/v1/customization/configs" \
  --data-urlencode "page_size=50" | jq

# List only disabled configurations
curl "${CUSTOMIZER_BASE_URL}/v1/customization/configs" \
  --data-urlencode "filter[enabled]=false" \
  --data-urlencode "page_size=50" | jq '.data[] | {name, description}'

Example Response

{
  "object": "list",
  "data": [
    {
      "name": "meta/llama-3.2-1b-instruct@v1.0.0+A100",
      "namespace": "default",
      "dataset_schemas": [
        {
          "title": "Newline-Delimited JSON File",
          "type": "array",
          "items": {
            "description": "Schema for Supervised Fine-Tuning (SFT) training data items.",
            "properties": {
              "prompt": {
                "description": "The prompt for the entry",
                "title": "Prompt",
                "type": "string"
              },
              "completion": {
                "description": "The completion to train on",
                "title": "Completion",
                "type": "string"
              }
            },
            "required": ["prompt", "completion"],
            "title": "SFTDatasetItemSchema",
            "type": "object"
          }
        }
      ],
      "training_options": [
        {
          "training_type": "sft",
          "finetuning_type": "lora",
          "num_gpus": 1,
          "num_nodes": 1,
          "tensor_parallel_size": 1,
          "use_sequence_parallel": false
        },
        {
          "training_type": "sft",
          "finetuning_type": "all_weights",
          "num_gpus": 1,
          "num_nodes": 1,
          "tensor_parallel_size": 1,
          "use_sequence_parallel": false
        }
      ]
    },
    {
      "name": "nvidia/llama-3.2-nv-embedqa-1b@v2+A100",
      "namespace": "nvidia",
      "dataset_schemas": [
        {
          "title": "Newline-Delimited JSON File",
          "type": "array",
          "items": {
            "description": "Schema for embedding training data items.",
            "properties": {
              "query": {
                "description": "The query to use as an anchor",
                "title": "Query",
                "type": "string"
              },
              "pos_doc": {
                "description": "A document that should match positively with the anchor",
                "title": "Positive Document",
                "type": "string"
              },
              "neg_doc": {
                "description": "Documents that should not match with the anchor",
                "title": "Negative Documents",
                "type": "array",
                "items": {"type": "string"}
              }
            },
            "required": ["query", "pos_doc", "neg_doc"],
            "title": "EmbeddingDatasetItemSchema",
            "type": "object"
          }
        }
      ],
      "training_options": [
        {
          "training_type": "sft",
          "finetuning_type": "lora_merged",
          "num_gpus": 1,
          "num_nodes": 1,
          "tensor_parallel_size": 1,
          "use_sequence_parallel": false
        }
      ]
    }
  ]
}

Note

Configuration vs. Target Architecture: Configurations don’t have their own enabled field. Instead, they inherit their availability from their underlying customization targets. When you see config.target.enabled, you’re checking whether the target model that the configuration references is enabled. This architecture allows administrators to control model availability at the target level, which affects all configurations that use that target.

Understanding Model Types and Capabilities#

Language Models#

These models are designed for text generation, instruction following, and conversational AI:

Language Model Options#
Model Family	Description	Examples
Llama Models	General-purpose language models excellent for instruction following, conversation, and text generation tasks	`llama-3.1-8b-instruct`, `llama-3.2-1b-instruct`
Llama Nemotron Models	NVIDIA’s specialized variants optimized for specific use cases with enhanced reasoning capabilities	Various Nano and Super variants
Phi Models	Microsoft’s efficient models designed for strong reasoning with optimized deployment characteristics	Phi model family configurations
GPT-OSS Models	Open-source GPT-based models supporting Full SFT customization workflows	Various GPT-OSS configurations

Specialized Models#

Specialized Model Support#
Model Type	Status	Details
Embedding Models	✅ Supported	Model: Llama 3.2 NV EmbedQA 1B for question-answering and retrieval tasks Use Cases: Semantic search, document retrieval, question-answering systems, RAG pipelines Note: Typically disabled by default—contact your administrator for access
Reranking Models	❌ Not Supported	Alternative: Use embedding models for retrieval tasks, or implement reranking in your application layer

Custom Models#

You can import and fine-tune models from the Hugging Face Transformers library:

Supported: Any model compatible with the Hugging Face Transformers architecture
Process: Import via the private HuggingFace model tutorial
Limitations: Some architectures (like Conv1D-based models) are not compatible

Training Types and Resource Requirements#

Available Training Approaches#

Training Type Comparison#
Training Type	Resource Usage	Training Speed	Flexibility	Best For
LoRA	Low (1-2 GPUs)	Fast	Good	Experiments, quick iterations
Full SFT	High (2-8 GPUs)	Slower	Maximum	Production, maximum performance
DPO	Medium (2-4 GPUs)	Medium	Specialized	Preference alignment
Knowledge Distillation	Medium (varies)	Medium	Specialized	Model compression

Checking Resource Requirements#

Use this approach to understand the GPU requirements for different configurations:

Python SDK

# Analyze resource requirements across configurations
configs = client.customization.configs.list(filter={"enabled": True})

print("Resource Requirements Summary:")
print("=" * 50)

for config in configs.data:
    print(f"\n📋 {config.name}")
    print(f"   Base Model: {config.target.base_model}")

    for option in config.training_options:
        gpu_total = option.num_gpus * option.num_nodes
        print(f"   • {option.training_type.upper()}/{option.finetuning_type.upper()}: {gpu_total} total GPUs")
        print(f"     ({option.num_gpus} GPUs × {option.num_nodes} nodes)")

# Find configurations that fit your hardware
available_gpus = 2  # Adjust based on your setup
print(f"\n🔍 Configurations that fit {available_gpus} GPUs:")

for config in configs.data:
    for option in config.training_options:
        if option.num_gpus <= available_gpus:
            print(f"   ✓ {config.name} - {option.training_type}/{option.finetuning_type}")

cURL

# Get resource information for all configurations
curl "${CUSTOMIZER_BASE_URL}/v1/customization/configs" \
  --data-urlencode "filter[enabled]=true" | \
  jq '.data[] | {
    name: .name,
    base_model: .target.base_model,
    training_options: .training_options | map({
      type: "\(.training_type)/\(.finetuning_type)",
      gpus: .num_gpus,
      nodes: .num_nodes,
      total_gpus: (.num_gpus * .num_nodes)
    })
  }'

Training Configuration Impact on Deployment#

Understanding how your training configuration choices affect deployment is crucial for planning your fine-tuning strategy. The parallelism and resource settings you choose during training have direct implications for how your models can be deployed and used.

Parallelism Parameters Explained#

When you examine training options in configurations, you’ll see several parallelism parameters that control how training workloads are distributed across GPUs:

Training Parallelism Parameters#
Parameter	Purpose	Impact on Training
tensor_parallel_size	Distributes model tensors across GPUs	Higher values reduce memory per GPU but require more GPUs
pipeline_parallel_size	Distributes model layers across GPUs	Enables training larger models by splitting layers
use_sequence_parallel	Enables sequence-level parallelism	Reduces memory usage for long sequences

Resource Allocation Rules#

Training configurations must satisfy mathematical constraints to work properly:

Important

GPU Allocation Rule: The total number of GPUs (num_gpus × num_nodes) must be a multiple of: tensor_parallel_size × pipeline_parallel_size × expert_model_parallel_size

If this constraint isn’t met, your training job will fail with a validation error.

Example Calculations:

8 GPUs with tensor_parallel_size=4, pipeline_parallel_size=2 ✅ Valid (8 = 4 × 2 × 1)
4 GPUs with tensor_parallel_size=4, pipeline_parallel_size=2 ❌ Invalid (4 ≠ 4 × 2 × 1)

Model Artifact Types and Deployment Paths#

Your training choices determine how the resulting model can be deployed:

Training Type to Deployment Mapping#
Training Type	Model Artifact	Deployment Method	Key Environment Variable
LoRA	Adapter weights only	Uses base model + adapters	`NIM_PEFT_SOURCE`
Full SFT	Complete model weights	Standalone model deployment	`NIM_FT_MODEL`
DPO	Complete model weights	Standalone model deployment	`NIM_FT_MODEL`

Deployment Architecture Overview#

The platform uses different deployment strategies based on your training approach:

LoRA Deployment

Architecture: Base model + adapter loading

Base model remains unchanged
Adapters loaded dynamically from Entity Store
Multiple adapters can share the same base model
Lower storage and memory requirements

Environment Configuration:

NIM_PEFT_SOURCE=http://nemo-entity-store:8000
NIM_PEFT_REFRESH_INTERVAL=30

Full Weight Deployment

Architecture: Complete model replacement

Entire model weights replaced with fine-tuned version
Requires dedicated deployment resources
Higher storage and memory requirements
Maximum customization flexibility

Environment Configuration:

NIM_FT_MODEL=/model-store
NIM_CUSTOM_MODEL=/model-store

Planning Your Training Strategy#

Consider these factors when choosing training configurations:

For Experimentation:

Choose LoRA with lower parallelism settings
Faster iteration cycles
Lower resource requirements
Easy to compare multiple approaches

For Production Deployment:

Consider full SFT for maximum performance
Plan for higher deployment resource requirements
Factor in model storage and loading times
Evaluate whether adapter flexibility is needed

Note

Deployment Guidance: For detailed information about deploying your fine-tuned models, including manual deployment options outside the NeMo platform, refer to the inference deployment documentation.

Making Configuration Decisions#

Decision Framework#

Use this framework to choose the right configuration for your project:

flowchart TD A[What's your task?] --> B{Text Generation?} A --> C{Q&A/Retrieval?} A --> D{Custom Model?} B -->|Yes| E[Language Models Llama, Phi, Nemotron] C -->|Yes| F[Embedding Models Llama 3.2 NV EmbedQA] D -->|Yes| G[Import from Hugging Face] E --> H{Resource Constraints?} F --> I[Contact Admin if Disabled] G --> J[Follow Import Tutorial] H -->|Low Resources Quick Experiment| K[Choose LoRA Config] H -->|High Resources Production Use| L[Choose Full SFT Config] K --> M[1-2 GPUs Needed] L --> N[4-8 GPUs Needed]

Example: Choosing a Configuration#

Let’s walk through a realistic example:

Scenario: You want to create an email writing assistant

Step 1: Identify Task Type

Task: Text generation (email writing)
Model family: Language models (Llama, Phi, etc.)

Step 2: Assess Resources

Available hardware: 2 A100 GPUs
Timeline: Need results within a day
Budget: Limited GPU hours

Step 3: Choose Training Type

Constraint: Limited resources and time
Choice: LoRA training (parameter-efficient)

Step 4: Find Matching Configuration

# Find LoRA configurations that fit 2 GPUs
suitable_configs = []

for config in configs.data:
    for option in config.training_options:
        if (option.finetuning_type == "lora" and
            option.num_gpus <= 2):
            suitable_configs.append({
                'name': config.name,
                'base_model': config.target.base_model,
                'gpus': option.num_gpus
            })

print("Suitable configurations for your use case:")
for config in suitable_configs:
    print(f"  ✓ {config['name']} ({config['gpus']} GPUs)")

Result: Choose llama-3.2-1b-instruct@v1.0.0+A100 with LoRA training

Getting Help with Configurations#

When Configurations Are Disabled#

If you find a configuration that meets your needs but is disabled:

Note the exact configuration name (e.g., llama-3.1-8b-instruct@v1.0.0+A100)
Contact your cluster administrator with a specific request
Provide context about your use case and why you need this configuration

Example Request Email:

Subject: Enable NeMo Customizer Configuration Request

Hi [Admin Name],

I need access to the following configuration for my project:
Configuration: llama-3.1-8b-instruct@v1.0.0+A100

Use Case: Fine-tuning a customer support chatbot
Training Type: LoRA (low resource requirements)
Timeline: Need to start training this week

This configuration appears to be available but disabled. Could you please enable it for our team?

Thanks!

Administrator Resources#

If you’re an administrator, refer to the configuration management documentation for guidance on:

Creating new configurations
Enabling/disabling configurations
Managing hardware resource allocation
Setting up configurations for different user groups

Next Steps#

Now that you understand NeMo Customizer configurations and models, you’re ready to proceed with fine-tuning:

Format Training Dataset

Learn how to prepare your data for the model type you’ve chosen.

Next Step: Data preparation

Format Training Dataset

Start a LoRA Job

Begin with parameter-efficient fine-tuning if you chose a LoRA configuration.

Best for: Experimentation, limited resources

Start a LoRA Model Customization Job

Start a Full SFT Job

Use full supervised fine-tuning if you chose an all_weights configuration.

Best for: Production use, maximum performance

Start a Full SFT Customization Job

Import Custom Models

Import and fine-tune private Hugging Face models.

Best for: Custom or proprietary models

Import and Fine-Tune Private HuggingFace Models

Key Takeaways#

✅ Configurations combine model + hardware + training options in pre-built recipes
✅ A100 configurations work on B200 hardware - compatibility is built-in
✅ LoRA requires fewer resources than Full SFT but offers less customization flexibility
✅ Training parallelism settings affect deployment requirements - plan accordingly
✅ LoRA uses adapters (NIM_PEFT_SOURCE) while Full SFT uses complete models (NIM_FT_MODEL)
✅ GPU allocation must satisfy mathematical constraints for training to succeed
✅ Disabled configurations can often be enabled by contacting your administrator
✅ Embedding models are supported for Q&A and retrieval tasks
✅ Reranking models are not currently supported - use embedding models instead
✅ Custom Hugging Face models can be imported with some architectural limitations

Quick Reference Commands#

# List enabled configurations
client.customization.configs.list(filter={"enabled": True})

# List all configurations (including disabled)  
client.customization.configs.list()

# List disabled configurations
client.customization.configs.list(filter={"enabled": False})

# Check resource requirements
for config in configs.data:
    for option in config.training_options:
        print(f"{config.name}: {option.finetuning_type} needs {option.num_gpus} GPUs")

You now have the foundation to make informed decisions about your fine-tuning projects and navigate the NeMo Customizer ecosystem effectively.