Managing NeMo Customizer#

About NeMo Customizer#

NeMo Customizer is as a lightweight API server that allows you to run managed training jobs on GPU nodes with the Volcano scheduler. Using this microservice, you can take LLMs models, either from NVIDIA or open-source, and customize them to fit your specific use cases. NeMo Customizer lets you to provide examples of desired responses to prompts and will tailor the model to make future responses matches to your examples.

For more details on customizing your LLMs, refer to the NeMo Customizer fine-tuning documentation.

Prerequisites#

All the common NeMo microservice prerequisites.
Mimimum system requirements
- A single-node Kubernetes cluster on a Linux host and cluster-admin level permissions.
- At least 200 GB of free disk space.
- At least one dedicated GPUs (A100 80 GB or H100 80 GB)
A NeMo Data Store and a NeMo Entity Store deployed on your cluster. The NeMo Entity Store and NeMo Data Store work closely together to hold information about the model entities on your cluster.
The NeMo Operator deployed on your cluster. This manages several training custom resources that are required to run customization jobs.

Note

You can use the NeMo Dependencies Ansible Playbook to deploy all the following NeMo Customizer microservice dependencies.

A Weights & Biases API Key. The NeMo Customizer microservice uses the provided API Key to send telemetry data including job Id, training loss, validation loss, and more to Weights & Biases to create various training and validation loss curves. Sign up for an W&B API key.
Volcano or Run:ai queue executor installed. Refer to the Volcano install documentation for more details.

Refer to the Run:ai installation documentation for more details.
OpenTelemetry Collector installed on your cluster. Read the OpenTelemetry documentation for details on installing OpenTelemetry Collect with Helm.

Storage

Access to an external PostgreSQL database to store model customization objects.
Access to an NFS-backed Persistent Volume that supports ReadWriteMany access mode to enable fast checkpointing and minimize network traffic.

The NeMo Customizer microservice creates PVCs to hold data while completing fine-tuning jobs. The NeMo Customizer custom resource configures two persistant volumes that are created and used to hold training job and model data.
- spec.trainingConfig.modelPVC: A PVC used to hold model data completing fine-tuning jobs. You can provide an existing PVC, or have the NIM Operator create the PVC for you. If you delete a NeMo Customizer resource that was created with spec.trainingConfig.modelPVC.create: true, the NIM Operator will also delete the persistent volume (PV) and persistent volume claim (PVC).
- spec.trainingConfig.workspacePVC: a PVC configuration for the NeMo Operator NemoTrainingJob custom resource. This object defines how the NeMo Operator automatically creates a PVC for each job.

Kubernetes

Create required secrets for your database user secret and a W&B API key secret.

Create a secret file, such as nemo-customizer-secrets.yaml, with contents like the following example:

---
apiVersion: v1
stringData:
  password: <ncspassword>
kind: Secret
metadata:
  name: <customizer-pg-existing-secret>
  namespace: nemo
type: Opaque
---
apiVersion: v1
stringData:
  wandb_api_key: <API-key>
kind: Secret
metadata:
  name: <wandb-secret>
  namespace: nemo
type: Opaque

Apply the secret file.

$ kubectl apply -n nemo -f nemo-customizer-secrets.yaml

Deploying a NeMo Customizer#

Update the following sample scripts <inputs> with values for your cluster configuration.

Refer to the Configure NeMo Customizer section for more details on configuration options.

Create a ConfigMap with your training configurations in a file such as nemo-customizer-training-config.yaml, with contents like the following example:

apiVersion: v1
kind: ConfigMap
metadata:
  name: nemo-training-config
  namespace: nemo
data:
  training: |
    # Optional additional configuration for training jobs
    container_defaults:
      imagePullPolicy: IfNotPresent

Apply the training ConfigMap file.

  $ kubectl apply -n nemo -f nemo-customizer-training-config.yaml

Create a ConfigMap with your model configurations in a file such as, nemo-customizer-model-config.yaml, with contents like the following example. Refer to the Model Configurations in the NeMo microservices documentation for details on configuring models and customization targets.

Note

The default configuration in the sample below lists all the supported models and enables the meta/llama-3.1-8b-instruct model to be downloaded. Update the configuration to enable one or more models you want to use in your customization job. Each enabled model is downloaded in a PVC by default and downloading several models will increase the storage requirements and startup time for NeMo Customizer.

apiVersion: v1
kind: ConfigMap
metadata:
  name: nemo-model-config
  namespace: nemo
data:
  customizationTargets: |
    overrideExistingTargets: true
    targets:
      meta/llama-3.1-8b-instruct@2.0:
        base_model: meta/llama-3.1-8b-instruct
        enabled: true
        model_path: llama-3_1-8b-instruct_2_0
        model_uri: ngc://nvidia/nemo/llama-3_1-8b-instruct-nemo:2.0
        name: llama-3.1-8b-instruct@2.0
        namespace: meta
        num_parameters: 8000000000
        precision: bf16-mixed
      meta/llama-3.1-70b-instruct@2.0:
        base_model: meta/llama-3.1-70b-instruct
        enabled: false
        model_path: llama-3_1-70b-instruct_2_0
        model_uri: ngc://nvidia/nemo/llama-3_1-70b-instruct-nemo:2.0
        name: llama-3.1-70b-instruct@2.0
        namespace: meta
        num_parameters: 70000000000
        precision: bf16-mixed
      meta/llama-3.2-1b-embedding@0.0.1:
        base_model: meta/llama-3.2-1b-embedding
        enabled: false
        model_path: llama32_1b-embedding
        model_uri: ngc://nvidia/nemo/llama-3_2-1b-embedding-base:0.0.1
        name: llama-3.2-1b-embedding@0.0.1
        namespace: meta
        num_parameters: 1000000000
        precision: bf16-mixed
      meta/llama-3.2-1b-instruct@2.0:
        base_model: meta/llama-3.2-1b-instruct
        enabled: true
        model_path: llama32_1b-instruct_2_0
        model_uri: ngc://nvidia/nemo/llama-3_2-1b-instruct:2.0
        name: llama-3.2-1b-instruct@2.0
        namespace: meta
        num_parameters: 1000000000
        precision: bf16-mixed
      meta/llama-3.2-1b@2.0:
        base_model: meta/llama-3.2-1b
        enabled: false
        model_path: llama32_1b_2_0
        model_uri: ngc://nvidia/nemo/llama-3_2-1b:2.0
        name: llama-3.2-1b@2.0
        namespace: meta
        num_parameters: 1000000000
        precision: bf16-mixed
      meta/llama-3.2-3b-instruct@2.0:
        base_model: meta/llama-3.2-3b-instruct
        enabled: false
        model_path: llama32_3b-instruct_2_0
        model_uri: ngc://nvidia/nemo/llama-3_2-3b-instruct:2.0
        name: llama-3.2-3b-instruct@2.0
        namespace: meta
        num_parameters: 3000000000
        precision: bf16-mixed
      meta/llama-3.3-70b-instruct@2.0:
        base_model: meta/llama-3.3-70b-instruct
        enabled: false
        model_path: llama-3_3-70b-instruct_2_0
        model_uri: ngc://nvidia/nemo/llama-3_3-70b-instruct:2.0
        name: llama-3.3-70b-instruct@2.0
        namespace: meta
        num_parameters: 70000000000
        precision: bf16-mixed
      meta/llama3-70b-instruct@2.0:
        base_model: meta/llama3-70b-instruct
        enabled: false
        model_path: llama-3-70b-bf16_2_0
        model_uri: ngc://nvidia/nemo/llama-3-70b-instruct-nemo:2.0
        name: llama3-70b-instruct@2.0
        namespace: meta
        num_parameters: 70000000000
        precision: bf16-mixed
      microsoft/phi-4@1.0:
        base_model: microsoft/phi-4
        enabled: false
        model_path: phi-4_1_0
        model_uri: ngc://nvidia/nemo/phi-4:1.0
        name: phi-4@1.0
        namespace: microsoft
        num_parameters: 14659507200
        precision: bf16
        version: "1.0"
      nvidia/nemotron-nano-llama-3.1-8b@1.0:
        base_model: nvidia/nemotron-nano-llama-3.1-8b
        enabled: false
        model_path: nemotron-nano-3_1-8b_0_0_1
        model_uri: ngc://nvidia/nemo/nemotron-nano-3_1-8b:0.0.1
        name: nemotron-nano-llama-3.1-8b@1.0
        namespace: nvidia
        num_parameters: 8000000000
        precision: bf16-mixed
      nvidia/nemotron-super-llama-3.3-49b@1.0:
        base_model: nvidia/nemotron-super-llama-3.3-49b
        enabled: false
        model_path: nemotron-super-3_3-49b_v1
        model_uri: ngc://nvidia/nemo/nemotron-super-3_3-49b:v1
        name: nemotron-super-llama-3.3-49b@1.0
        namespace: nvidia
        num_parameters: 8000000000
        precision: bf16-mixed

  customizationConfigTemplates: |
    overrideExistingTemplates: true
    templates:
      meta/llama-3.1-8b-instruct@v1.0.0+A100:
        max_seq_length: 4096
        name: llama-3.1-8b-instruct@v1.0.0+A100
        namespace: meta
        prompt_template: '{prompt} {completion}'
        target: meta/llama-3.1-8b-instruct@2.0
        training_options:
        - finetuning_type: lora
          micro_batch_size: 1
          num_gpus: 1
          training_type: sft
        - finetuning_type: all_weights
          micro_batch_size: 1
          num_gpus: 8
          num_nodes: 1
          tensor_parallel_size: 4
          training_type: sft
        - finetuning_type: all_weights
          micro_batch_size: 1
          num_gpus: 4
          num_nodes: 1
          tensor_parallel_size: 4
          training_type: distillation
      meta/llama-3.1-8b-instruct@v1.0.0+L40:
        max_seq_length: 4096
        name: llama-3.1-8b-instruct@v1.0.0+L40
        namespace: meta
        prompt_template: '{prompt} {completion}'
        target: meta/llama-3.1-8b-instruct@2.0
        training_options:
        - finetuning_type: lora
          micro_batch_size: 1
          num_gpus: 2
          tensor_parallel_size: 2
          training_type: sft
        - finetuning_type: all_weights
          micro_batch_size: 1
          num_gpus: 4
          num_nodes: 2
          tensor_parallel_size: 4
          training_type: sft
      meta/llama-3.1-70b-instruct@v1.0.0+A100:
        max_seq_length: 4096
        name: llama-3.1-70b-instruct@v1.0.0+A100
        namespace: meta
        prompt_template: '{prompt} {completion}'
        target: meta/llama-3.1-70b-instruct@2.0
        training_options:
        - finetuning_type: lora
          micro_batch_size: 1
          num_gpus: 4
          num_nodes: 1
          tensor_parallel_size: 4
          training_type: sft
      meta/llama-3.1-70b-instruct@v1.0.0+L40:
        max_seq_length: 4096
        name: llama-3.1-70b-instruct@v1.0.0+L40
        namespace: meta
        prompt_template: '{prompt} {completion}'
        target: meta/llama-3.1-70b-instruct@2.0
        training_options:
        - finetuning_type: lora
          micro_batch_size: 1
          num_gpus: 4
          num_nodes: 2
          pipeline_parallel_size: 2
          tensor_parallel_size: 4
          training_type: sft
      meta/llama-3.2-1b-embedding@0.0.1+A100:
        max_seq_length: 2048
        name: llama-3.2-1b-embedding@0.0.1+A100
        namespace: meta
        prompt_template: '{prompt} {completion}'
        target: meta/llama-3.2-1b-embedding@0.0.1
        training_options:
        - finetuning_type: all_weights
          micro_batch_size: 8
          num_gpus: 1
          num_nodes: 1
          tensor_parallel_size: 1
          training_type: sft
      meta/llama-3.2-1b-embedding@0.0.1+L40:
        max_seq_length: 2048
        name: llama-3.2-1b-embedding@0.0.1+L40
        namespace: meta
        target: meta/llama-3.2-1b-embedding@0.0.1
        training_options:
        - finetuning_type: all_weights
          micro_batch_size: 4
          num_gpus: 1
          num_nodes: 1
          tensor_parallel_size: 1
          training_type: sft
      meta/llama-3.2-1b-instruct@v1.0.0+A100:
        max_seq_length: 4096
        name: llama-3.2-1b-instruct@v1.0.0+A100
        namespace: meta
        prompt_template: '{prompt} {completion}'
        target: meta/llama-3.2-1b-instruct@2.0
        training_options:
        - finetuning_type: lora
          micro_batch_size: 1
          num_gpus: 1
          num_nodes: 1
          tensor_parallel_size: 1
          training_type: sft
        - finetuning_type: all_weights
          micro_batch_size: 1
          num_gpus: 1
          num_nodes: 1
          tensor_parallel_size: 1
          training_type: sft
        - finetuning_type: all_weights
          micro_batch_size: 1
          num_gpus: 1
          num_nodes: 1
          tensor_parallel_size: 1
          training_type: distillation
      meta/llama-3.2-1b-instruct@v1.0.0+L40:
        max_seq_length: 4096
        name: llama-3.2-1b-instruct@v1.0.0+L40
        namespace: meta
        prompt_template: '{prompt} {completion}'
        target: meta/llama-3.2-1b-instruct@2.0
        training_options:
        - finetuning_type: lora
          micro_batch_size: 1
          num_gpus: 1
          num_nodes: 1
          tensor_parallel_size: 1
          training_type: sft
        - finetuning_type: all_weights
          micro_batch_size: 1
          num_gpus: 1
          num_nodes: 1
          tensor_parallel_size: 1
          training_type: sft
      meta/llama-3.2-1b@v1.0.0+A100:
        max_seq_length: 4096
        name: llama-3.2-1b@v1.0.0+A100
        namespace: meta
        prompt_template: '{prompt} {completion}'
        target: meta/llama-3.2-1b@2.0
        training_options:
        - finetuning_type: lora
          micro_batch_size: 1
          num_gpus: 1
          num_nodes: 1
          tensor_parallel_size: 1
          training_type: sft
        - finetuning_type: all_weights
          micro_batch_size: 1
          num_gpus: 1
          num_nodes: 1
          tensor_parallel_size: 1
          training_type: sft
        - finetuning_type: all_weights
          micro_batch_size: 1
          num_gpus: 1
          num_nodes: 1
          tensor_parallel_size: 1
          training_type: distillation
      meta/llama-3.2-1b@v1.0.0+L40:
        max_seq_length: 4096
        name: llama-3.2-1b@v1.0.0+L40
        namespace: meta
        prompt_template: '{prompt} {completion}'
        target: meta/llama-3.2-1b@2.0
        training_options:
        - finetuning_type: lora
          micro_batch_size: 1
          num_gpus: 1
          num_nodes: 1
          tensor_parallel_size: 1
          training_type: sft
        - finetuning_type: all_weights
          micro_batch_size: 1
          num_gpus: 1
          num_nodes: 1
          tensor_parallel_size: 1
          training_type: sft
      meta/llama-3.2-3b-instruct@v1.0.0+A100:
        max_seq_length: 4096
        name: llama-3.2-3b-instruct@v1.0.0+A100
        namespace: meta
        prompt_template: '{prompt} {completion}'
        target: meta/llama-3.2-3b-instruct@2.0
        training_options:
        - finetuning_type: lora
          micro_batch_size: 1
          num_gpus: 1
          num_nodes: 1
          tensor_parallel_size: 1
          training_type: sft
        - finetuning_type: all_weights
          micro_batch_size: 1
          num_gpus: 2
          num_nodes: 1
          tensor_parallel_size: 1
          training_type: distillation
      meta/llama-3.2-3b-instruct@v1.0.0+L40:
        max_seq_length: 4096
        name: llama-3.2-3b-instruct@v1.0.0+L40
        namespace: meta
        prompt_template: '{prompt} {completion}'
        target: meta/llama-3.2-3b-instruct@2.0
        training_options:
        - finetuning_type: lora
          micro_batch_size: 1
          num_gpus: 1
          num_nodes: 1
          tensor_parallel_size: 1
          training_type: sft
      meta/llama-3.3-70b-instruct@v1.0.0+A100:
        max_seq_length: 4096
        name: llama-3.3-70b-instruct@v1.0.0+A100
        namespace: meta
        prompt_template: '{prompt} {completion}'
        target: meta/llama-3.3-70b-instruct@2.0
        training_options:
        - finetuning_type: lora
          micro_batch_size: 1
          num_gpus: 4
          num_nodes: 1
          tensor_parallel_size: 4
          training_type: sft
      meta/llama-3.3-70b-instruct@v1.0.0+L40:
        max_seq_length: 4096
        name: llama-3.3-70b-instruct@v1.0.0+L40
        namespace: meta
        prompt_template: '{prompt} {completion}'
        target: meta/llama-3.3-70b-instruct@2.0
        training_options:
        - finetuning_type: lora
          micro_batch_size: 1
          num_gpus: 4
          num_nodes: 2
          pipeline_parallel_size: 2
          tensor_parallel_size: 4
          training_type: sft
      meta/llama3-70b-instruct@v1.0.0+A100:
        max_seq_length: 4096
        name: llama3-70b-instruct@v1.0.0+A100
        namespace: meta
        prompt_template: '{prompt} {completion}'
        target: meta/llama3-70b-instruct@2.0
        training_options:
        - finetuning_type: lora
          micro_batch_size: 1
          num_gpus: 4
          num_nodes: 1
          tensor_parallel_size: 4
          training_type: sft
      meta/llama3-70b-instruct@v1.0.0+L40:
        max_seq_length: 4096
        name: llama3-70b-instruct@v1.0.0+L40
        namespace: meta
        prompt_template: '{prompt} {completion}'
        target: meta/llama3-70b-instruct@2.0
        training_options:
        - finetuning_type: lora
          micro_batch_size: 1
          num_gpus: 4
          num_nodes: 2
          pipeline_parallel_size: 2
          tensor_parallel_size: 4
          training_type: sft
      microsoft/phi-4@v1.0.0+A100:
        max_seq_length: 4096
        name: phi-4@v1.0.0+A100
        namespace: microsoft
        prompt_template: '{prompt} {completion}'
        target: microsoft/phi-4@1.0
        training_options:
        - finetuning_type: lora
          micro_batch_size: 1
          num_gpus: 1
          num_nodes: 1
          training_type: sft
      microsoft/phi-4@v1.0.0+L40:
        max_seq_length: 4096
        name: phi-4@v1.0.0+L40
        namespace: microsoft
        prompt_template: '{prompt} {completion}'
        target: microsoft/phi-4@1.0
        training_options:
        - data_parallel_size: 2
          finetuning_type: lora
          micro_batch_size: 1
          num_gpus: 2
          num_nodes: 1
          tensor_parallel_size: 1
          training_type: sft
      nvidia/nemotron-nano-llama-3.1-8b@v1.0.0+A100:
        max_seq_length: 4096
        name: nemotron-nano-llama-3.1-8b@v1.0.0+A100
        namespace: nvidia
        prompt_template: '{prompt} {completion}'
        target: nvidia/nemotron-nano-llama-3.1-8b@1.0
        training_options:
        - finetuning_type: lora
          micro_batch_size: 1
          num_gpus: 1
          num_nodes: 1
          tensor_parallel_size: 1
          training_type: sft
        - finetuning_type: all_weights
          micro_batch_size: 1
          num_gpus: 8
          num_nodes: 1
          tensor_parallel_size: 4
          training_type: sft
      nvidia/nemotron-nano-llama-3.1-8b@v1.0.0+L40:
        max_seq_length: 4096
        name: nemotron-nano-llama-3.1-8b@v1.0.0+L40
        namespace: nvidia
        prompt_template: '{prompt} {completion}'
        target: nvidia/nemotron-nano-llama-3.1-8b@1.0
        training_options:
        - finetuning_type: lora
          micro_batch_size: 1
          num_gpus: 2
          num_nodes: 1
          tensor_parallel_size: 2
          training_type: sft
        - finetuning_type: all_weights
          micro_batch_size: 1
          num_gpus: 4
          num_nodes: 2
          pipeline_parallel_size: 2
          tensor_parallel_size: 4
          training_type: sft
      nvidia/nemotron-super-llama-3.3-49b@v1.0.0+A100:
        max_seq_length: 4096
        name: nemotron-super-llama-3.3-49b@v1.0.0+A100
        namespace: nvidia
        prompt_template: '{prompt} {completion}'
        target: nvidia/nemotron-super-llama-3.3-49b@1.0
        training_options:
        - finetuning_type: lora
          micro_batch_size: 1
          num_gpus: 4
          num_nodes: 1
          tensor_parallel_size: 4
          training_type: sft
      nvidia/nemotron-super-llama-3.3-49b@v1.0.0+L40:
        max_seq_length: 4096
        name: nemotron-super-llama-3.3-49b@v1.0.0+L40
        namespace: nvidia
        prompt_template: '{prompt} {completion}'
        target: nvidia/nemotron-super-llama-3.3-49b@1.0
        training_options:
        - finetuning_type: lora
          micro_batch_size: 1
          num_gpus: 4
          num_nodes: 2
          pipeline_parallel_size: 2
          tensor_parallel_size: 4
          training_type: sft

Apply the ConfigMap file.

$ kubectl apply -n nemo -f nemo-customizer-model-config.yaml

Create a file, such as nemo-customizer.yaml, with contents like the following example:

apiVersion: apps.nvidia.com/v1alpha1
kind: NemoCustomizer
metadata:
  name: nemocustomizer-sample
  namespace: nemo
spec:
  # Scheduler configuration for training jobs (volcano (default))
  scheduler:
    type: "<volcano>"
  # Weights & Biases configuration for experiment tracking
  wandb:
    secretName: <wandb-secret>       # Kubernetes secret that stores WANDB_API_KEY and optionally encryption key
    apiKeyKey: <apiKey>                 # Key in the secret that holds the W&B API key
    encryptionKey: <encryptionKey>   # Key in the secret that holds optional encryption key
  # OpenTelemetry tracing configuration
  otel:
    enabled: true
    exporterOtlpEndpoint: http://<customizer-otel-opentelemetry-collector>.<nemo>.svc.cluster.local:4317
  # PostgreSQL database connection configuration
  databaseConfig:
    credentials:
      user: <ncsuser>                        # Database username
      secretName: <customizer-pg-existing-secret>  # Secret containing password
      passwordKey: <password>               # Key inside secret that contains the password
    host: <customizer-pg-postgresql>.<nemo>.svc.cluster.local
    port: 5432
    databaseName: <ncsdb>
  # Customizer API service exposure settings
  expose:
    service:
      type: ClusterIP
      port: 8000
  # Global image pull settings used in various subcomponents
  image:
    repository: nvcr.io/nvidia/nemo-microservices/customizer-api
    tag: "25.06"
    pullPolicy: IfNotPresent
    pullSecrets:
      - ngc-secret
  # URL to the NeMo Entity Store microservice
  entitystore:
    endpoint: http://<nemoentitystore-sample>.<nemo>.svc.cluster.local:8000
  # URL to the NeMo Data Store microservice
  datastore:
    endpoint: http://<nemodatastore-sample>.<nemo>.svc.cluster.local:8000
  # URL for MLflow tracking server
  mlflow: 
    endpoint: http://<mlflow-tracking>.<nemo>.svc.cluster.local:80
  # Configuration for the data store CLI tools
  nemoDatastoreTools:
    image: nvcr.io/nvidia/nemo-microservices/nds-v2-huggingface-cli:25.06
  # Configuration for model download jobs
  modelDownloadJobs:
    image: "nvcr.io/nvidia/nemo-microservices/customizer-api:25.06"
    ngcAPISecret:
      # Secret that stores NGC API key
      name: ngc-api-secret
      # Key inside secret         
      key: "NGC_API_KEY"                 
    securityContext:
      fsGroup: 1000
      runAsNonRoot: true
      runAsUser: 1000
      runAsGroup: 1000
     # Time (in seconds) to retain job after completion
    ttlSecondsAfterFinished: 600   
    # Polling frequency to check job status     
    pollIntervalSeconds: 15              
  # Name to the ConfigMap containing model definitions
  modelConfig:
    name: <nemo-model-config>
  # Training configuration
  trainingConfig:
    configMap:
      # Optional: Additional configuration to merge into training config
      name: <nemo-training-config>         
    # PVC where model artifacts are cached or used during training
    modelPVC:
      create: true
      name: <finetuning-ms-models-pvc>
      # StorageClass for the PVC (can be empty to use default)
      storageClass: ""
      volumeAccessMode: ReadWriteMany
      size: 50Gi
    # Workspace PVC automatically created per job
    workspacePVC:
      storageClass: ""
      volumeAccessMode: ReadWriteMany
      size: 10Gi
      # Mount path for workspace inside container
      mountPath: /pvc/workspace          
    image:
      repository: nvcr.io/nvidia/nemo-microservices/customizer
      tag: "25.06"
    env:
      - name: LOG_LEVEL
        value: INFO                    
    # Multi-node networking environment variables for training (CSPs)
    networkConfig:
      - name: NCCL_IB_SL
        value: "0"
      - name: NCCL_IB_TC
        value: "41"
      - name: NCCL_IB_QPS_PER_CONNECTION
        value: "4"
      - name: UCX_TLS
        value: TCP
      - name: UCX_NET_DEVICES
        value: eth0
      - name: HCOLL_ENABLE_MCAST_ALL
        value: "0"
      - name: NCCL_IB_GID_INDEX
        value: "3"
    # TTL for training job after it completes
    ttlSecondsAfterFinished: 3600       
    # Timeout duration (in seconds) for training job
    timeout: 3600                       
    # Node tolerations
    tolerations:
      - key: "nvidia.com/gpu"
        operator: "Exists"
        effect: "NoSchedule"

Apply the manifest:
```
$ kubectl apply -n nemo -f nemo-customizer.yaml
```
Note

NeMo Customizer image is large and it will take a few minutes to download from the registy.

Verify NeMo Customizer#

View NeMo Customizer status:

$ kubectl get nemocustomizer.apps.nvidia.com -n nemo

Partial Output

 NAME                    STATUS     AGE
 nemocustomizer-sample   Ready   7s

View information about the NeMo Customizer:

$ kubectl describe nemocustomizer.apps.nvidia.com  nemocustomizer-sample -n nemo

Partial Output

...
Status:
 Conditions:
   Last Transition Time:  2025-04-24T17:40:04Z
   Message:               deployment "nemocustomizer-sample" successfully rolled out

   Reason:                Ready
   Status:                True
   Type:                  Ready
   Last Transition Time:  2025-04-24T17:39:34Z
   Message:
   Reason:                Ready
   Status:                False
   Type:                  Failed
 State:                   Ready

Check NeMo Customizer Service is Reachable#

Once you have a NeMo Customizer deployed on your cluster, use the steps below to verify the service is up and runnig.

Start a pod that has access to the curl command. Substitute any pod that has this command and meets your organization’s security requirements.
```
$ kubectl run --rm -it -n default curl --image=curlimages/curl:latest -- ash
```
After the pod starts, you are connected to the ash shell in the pod.

Connect to the NeMo Customizer service

$ curl -X GET "http://nemocustomizer-sample.nemo:8000/v1/customization/configs"

Press Ctrl+D to exit and delete the pod.

Configure NeMo Customizer#

The following table shows more information about the commonly modified fields for the NeMo Data Store custom resource.

Field	Description	Default Value
`spec.annotations`	Specifies to add the user-supplied annotations to the pod.	None
`spec.databaseConfig` (required)	Specifies the external PostgreSQL configuration details.	None
`spec.databaseConfig.credentials.passwordKey`	Specifies the password key used in the database credentials secret.	`password`
`spec.databaseConfig.credentials.secretName` (required)	Specifies the secret name for the database credentials.	None
`spec.databaseConfig.credentials.user` (required)	Specifies the user for the database.	None
`spec.databaseConfig.databaseName` (required)	Specifies the name for the database.	None
`spec.databaseConfig.host` (required)	Specifies the endpoint for the database.	None
`spec.databaseConfig.port`	Specifies the port for the database.	`5432`
`spec.datastore.endpoint` (required)	Specifies the endpoint for the NeMo Data Store to use for customization jobs.	none
`spec.entitystore.endpoint` (required)	Specifies the endpoint for the NeMo Entity Store to use for customization jobs.	none
`spec.expose`	Specifies attributes to expose a service for this NeMo microservice. Use an expose object to specify Kubernetes Ingress and Service information.	None
`spec.expose.ingress.enabled`	When set to `true`, the Operator creates a Ingress resource for the NeMo Customizer. Specify the ingress specification in the `spec.expose.ingress.spec` field. If you have an ingress controller, values like the following sample configures an ingress for the `/` endpoint. ingress: enabled: true spec: ingressClassName: nginx host: nemo-customizer.example.com paths: - path: / pathType: Prefix	`false`
`spec.expose.service.port`	Specifies the network port number for the NeMo Evaluator microservice.	`8000`
`spec.expose.service.type`	Specifies the Kubernetes service type to create for the NIM microservice.	`ClusterIP`
`spec.groupID`	Specifies the group for the pods. This value is used to set the security context of the pod in the `runAsGroup` and `fsGroup` fields.	`2000`
`spec.image` (required)	Specifies repository, tag, pull policy, and pull secret for the container image. You must specify the repository and tag for the NeMo microservice image you are using.	None
`spec.labels`	Specifies the user-supplied labels to add to the pod.	None
`spec.metrics.enabled`	When set to `true`, the Operator configures a Prometheus service monitor for the service. Specify the service monitor specification in the `spec.metrics.serviceMonitor` field. Refer to the Observability page for more details.	`false`
`spec.mlflow` (required)	Specifies the MLFlow tracking endpoint deployed on your cluster.	None
`spec.modelConfig.name` (required)	Specifies the name of the ConfigMap containing you model definitions.	None
`spec.modelDownloadJobs.hfSecret.key` (required)	Specifies the key in the secret that contains your HuggingFace Hub API token. Required if you include the `spec.modelDownloadJobs.hfSecret` object.	None
`spec.modelDownloadJobs.hfSecret.name` (required)	Specifies the name in the secret that contains your HuggingFace Hub API token. Required if you include the `spec.modelDownloadJobs.hfSecret` object.	None
`spec.modelDownloadJobs.image` (required)	Specifies the image to use for model downloader jobs.	None
`spec.modelDownloadJobs.imagePullPolicy`	Specifies the image pull policy to use for model downloader image.	None
`spec.modelDownloadJobs.ngcAPISecret.key`	Specifies the name of the key in your NGC secret that contains your NGC API Key. Refer to the Image Pull Secrets page for more details on creating this secret. Required if you include the `spec.modelDownloadJobs.ngcAPISecret` object.	None
`spec.modelDownloadJobs.ngcAPISecret.name`	Specifies the secret name with your NGC API Key. Refer to the Image Pull Secrets page for more details on creating this secret. Required if you include the `spec.modelDownloadJobs.ngcAPISecret` object.	None
`spec.modelDownloadJobs.pollIntervalSeconds` (required)	Specifies the polling interval for model download status.	None
`spec.modelDownloadJobs.securityContext`	Specifies the Kubernetes security context for the model downloader.	None
`spec.modelDownloadJobs.ttlSecondsAfterFinished` (required)	Specifies the time to live after the model downloader job finishes in seconds.	None
`spec.nemoDatastoreTools.image` (required)	Specifies the image to use for the NeMo Datastore CLI tools.	None
`spec.otel.disableLogging`	When set to `true`, Python logging auto-instrumentation is enabled.	None
`spec.otel.disableLogging`	When set to `true`, Python logging auto-instrumentation is enabled.	None
`spec.otel.excludeUrls`	Specifies URLs to be excluded from tracing.	None
`spec.otel.exporterConfig.logsExporter`	Specifies the log exporter. Values include `otlp`, `console`, `none`.	None
`spec.otel.exporterConfig.metricsExporter`	Specifies the metrics exporter. Values include `otlp`, `console`, `none`.	None
`spec.otel.exporterConfig.traceExporter`	Specifies the trace exporter. Values include `otlp`, `console`, `none`.	None
`spec.otel.OtlpEndpoint`	Specifies the OpenTelemetry Protocol endpoint.	None
`spec.otel.logLevel`	Specifies the log level for OpenTelemetry. Values include `INFO` and `DEBUG`.	None
`spec.replicas`	Specifies the number of replicas to have on the cluster.	None
`spec.resources.requests`	Specifies the memory and CPU request.	None
`spec.resources.limits`	Specifies the memory and CPU limits.	None
`spec.scheduler`	Specifies the scheduler type to use for cusotmization jobs. Available values are `volcano`.	None
`spec.tolerations`	Specifies the tolerations for the pods.	None
`spec.trainingConfig.configMap.name` (required)	Specifies a ConfigMap of your training configuration. Its recommended that you create the ConfigMap with your training configurations ahead of creating a NeMo Customizer. Note that if you make adjustments to your trianing configurations after deploying the NeMO Customizer, the service must be restarted. Refer to the NeMo Customizer configuration documentation for details on setting up your training configuration.	None
`spec.trainingConfig.env`	Specifies enviroment variables passed to training jobs.	None
`spec.trainingConfig.image` (required)	Specifies the repository, tag, pull policy, and pull secret for the NeMo Customizer image used for training. You must specify the repository and tag image you are using.	None
`spec.trainingConfig.modelPVC.create` (required)	When set to `true`, the Operator creates the PVC where model artifacts are cached or used during training. If you delete a NeMo customizer resource and this field was set to `true`, the Operator deletes the PVC and the cached models.	`false`
`spec.trainingConfig.modelPVC.name` (required)	Specifies the PVC name. This field is required if you specify `create: false`.	The NeMo Customizer resource name with a `-pvc` suffix.
`spec.trainingConfig.modelPVC.size` (required)	Specifies the size, in Gi, for the PVC to create. This field is required if you specify `create: true`.	None
`spec.trainingConfig.modelPVC.storageClass` (required)	Specifies the Kubernetes StorageClass for the PVC. Leave this empty to use your cluster’s default StorageClase.	None
`spec.trainingConfig.modelPVC.subPath`	Specifies the subpath inside the PVC that is mounted. for the PVC to create.	None
`spec.trainingConfig.modelPVC.volumeAccessMode` (required)	Specifies the access mode for the PVC to create. NeMo Customzier requires a volume access mode of ReadWriteMany.	None
`spec.trainingConfig.networkConfig`	Specifies the network configuration for multi-node training. Use `name` and `value` pairs to define your network. For example, - name: NCCL_IB_SL value: "0" - name: NCCL_IB_TC value: "41" - name: NCCL_IB_QPS_PER_CONNECTION value: "4" - name: UCX_TLS value: TCP - name: UCX_NET_DEVICES value: eth0 - name: HCOLL_ENABLE_MCAST_ALL value: "0" - name: NCCL_IB_GID_INDEX value: "3"	None
`spec.trainingConfig.nodeSlector`	Specifies the node selector labels for where to run training jobs.	None
`spec.trainingConfig.podAffinity`	Specifies the PodAffinity for the training jobs.	None
`spec.trainingConfig.resources`	Specifies the resources for the training jobs.	None
`spec.trainingConfig.sharedMemorySizeLimit`	Specifies the max size of the shared memory volume (emptyDir) used by training jobs for fast model runtime read and write operations. If not specified, the NIM Operator will create an emptyDir with no resource limit.	None
`spec.trainingConfig.timeOut`	Specifies the timeout limit for the training jobs to complete.	None
`spec.trainingConfig.ttlSecondsAfterFinished`	Specifies the time to live after the training job finishes in seconds.	None
`spec.trainingConfig.workspacePVC` (required)	Specifies PVC configuration for the NeMo Operator NemoTrainingJob custom resource. A PVC is automatically created for each job. Use the `workspacePVC` object to define how to deploy these PVCs.	None
`spec.trainingConfig.workspacePVC.mountPath`	Specifies the path where the workspace PVC is mounted within the training job.	`/pvc/workspace`
`spec.trainingConfig.workspacePVC.size` (required)	Specifies the size, in Gi, for the PVC to create.	None
`spec.trainingConfig.workspacePVC.storageClass` (required)	Specifies the Kubernetes StorageClass for the PVC. Leave this empty to use your cluster’s default StorageClase.	None
`spec.trainingConfig.workspacePVC.volumeAccessMode` (required)	Specifies the access mode for the PVC to create. NeMo Customizer requires a volume access mode of ReadWriteMany.	None
`spec.userID`	Specifies the user ID for the pod. This value is used to set the security context of the pod in the `runAsUser` fields.	`1000`
`spec.wandbSecret.apiKeyKey` (required)	Specifies the key in the secret that holds the Weights and Biases API key.	None
`spec.wandbSecret.encryptionKey`	Specifies an optional key in the secret used for encrypting Weights&Biases credentials. This can be used for additional security layers if required.	`encryptionKey`
`spec.wandbSecret.name` (required)	Specifies the name of the Kubernetes Secret containing the Weights&Biases API key.	None