Managing NeMo Customizer#

About NeMo Customizer#

NeMo Customizer is as a lightweight API server that allows you to run managed training jobs on GPU nodes with the Volcano scheduler. Using this microservice, you can take LLMs models, either from NVIDIA or open-source, and customize them to fit your specific use cases. NeMo Customizer lets you to provide examples of desired responses to prompts and will tailor the model to make future responses matches to your examples.

Read the NeMo Customizer documentation for details on customizing your LLM models.

Prerequisites#

  • All the common NeMo microservice prerequisites.

  • Mimimum system requirements

    • A single-node Kubernetes cluster on a Linux host and cluster-admin level permissions.

    • At least 200 GB of free disk space.

    • At least one dedicated GPUs (A100 80 GB or H100 80 GB)

  • A NeMo Data Store and a NeMo Entity Store deployed on your cluster. The NeMo Entity Store and NeMo Data Store work closely together to hold information about the model entities on your cluster.

  • The NeMo Operator deployed on your cluster. This manages several training custom resources that are required to run customization jobs.

Note

You can use the NeMo Dependencies Ansible Playbook to deploy all the following NeMo Customizer microservice dependencies.

  • A Weights & Biases API Key. The NeMo Customizer microservice uses the provided API Key to send telemetry data including job Id, training loss, validation loss, and more to Weights & Biases to create various training and validation loss curves. Sign up for an W&B API key.

  • Volcano scheduler installed. Read the Volcano install documentation for details on installing with Helm.

  • OpenTelemetry Collector installed on your cluster. Read the OpenTelemetry documentation for details on installing OpenTelemetry Collect with Helm.

Storage

  • Access to an external PostgreSQL database to store model customization objects.

  • Access to an NFS-backed Persistent Volume that supports ReadWriteMany access mode to enable fast checkpointing and minimize network traffic.

    The NeMo Customizer microservice creates PVCs to hold data while completing fine-tuning jobs. The NeMo Customizer custom resource configures two persistant volumes that are created and used to hold training job and model data.

    • spec.trainingConfig.modelPVC: A PVC used to hold model data completing fine-tuning jobs. You can provide an existing PVC, or have the NIM Operator create the PVC for you. If you delete a NeMo Customizer resource that was created with spec.trainingConfig.modelPVC.create: true, the NIM Operator will also delete the persistent volume (PV) and persistent volume claim (PVC).

    • spec.trainingConfig.workspacePVC: a PVC configuration for the NeMo Operator NemoTrainingJob custom resource. This object defines how the NeMo Operator automatically creates a PVC for each job.

Kubernetes

  • Create required secrets for your database user secret and a W&B API key secret.

    Create a secret file, such as nemo-customizer-secrets.yaml, with contents like the following example:

    ---
    apiVersion: v1
    stringData:
      password: <ncspassword>
    kind: Secret
    metadata:
      name: <customizer-pg-existing-secret>
      namespace: nemo
    type: Opaque
    ---
    apiVersion: v1
    stringData:
      wandb_api_key: <API-key>
    kind: Secret
    metadata:
      name: <wandb-secret>
      namespace: nemo
    type: Opaque
    

    Apply the secret file.

    $ kubectl apply -n nemo -f nemo-customizer-secrets.yaml
    

Deploying a NeMo Customizer#

Update the following sample scripts <inputs> with values for your cluster configuration.

Refer to the Configure NeMo Customizer section for more details on configuration options.

  1. Create a ConfigMap with your training configurations in a file such as nemo-customizer-training-config.yaml, with contents like the following example:

    apiVersion: v1
    kind: ConfigMap
    metadata:
      name: nemo-training-config
      namespace: nemo
    data:
      training: |
        # Optional additional configuration for training jobs
        container_defaults:
          imagePullPolicy: IfNotPresent
    
  2. Apply the training ConfigMap file.

      $ kubectl apply -n nemo -f nemo-customizer-training-config.yaml
    
  3. Create a ConfigMap with your model configurations in a file such as, nemo-customizer-model-config.yaml, with contents like the following example. Refer to the Model Configurations in the NeMo microservices documentation for details on configuring models.

    Note

    The default configuration in the sample below lists all the supported models and enables the meta/llama-3.1-8b-instruct model to be downloaded. Update the configuration to enable one or more models you want to use in your customization job. Each enabled model is downloaded in a PVC by default and downloading several models will increase the storage requirements and startup time for NeMo Customizer.

    apiVersion: v1
    kind: ConfigMap
    metadata:
      name: nemo-model-config
      namespace: nemo
    data:
      models: |
        # -- Llama 3.2 3B Instruct model configuration.
        # @default -- This object has the following default values for the Llama 3.2 3B Instruct model.
        meta/llama-3.2-3b-instruct:
          # -- Whether to enable the model.
          enabled: false
          # -- NGC model URI.
          model_uri: ngc://nvidia/nemo/llama-3_2-3b-instruct:2.0
          # -- Path where model files are stored.
          model_path: llama32_3b-instruct
          # -- Training options for different fine-tuning methods.
          training_options:
            - training_type: sft
              finetuning_type: lora
              num_gpus: 1
              num_nodes: 1
              tensor_parallel_size: 1
          # -- Micro batch size for training.
          micro_batch_size: 1
          # -- Maximum sequence length for input tokens.
          max_seq_length: 4096
          # -- Number of model parameters.
          num_parameters: 3000000000
          # -- Model precision format.
          precision: bf16-mixed
          # -- Template for formatting prompts.
          prompt_template: "{prompt} {completion}"
    
        # -- Llama 3.2 1B model configuration.
        # @default -- This object has the following default values for the Llama 3.2 1B model.
        meta/llama-3.2-1b:
          # -- Whether to enable the model.
          enabled: false
          # -- NGC model URI for Llama 3.2 1B model.
          model_uri: ngc://nvidia/nemo/llama-3_2-1b:2.0
          # -- Path where model files are stored.
          model_path: llama32_1b
          # -- Training options for different fine-tuning methods.
          training_options:
            - training_type: sft
              finetuning_type: lora
              num_gpus: 1
              num_nodes: 1
              tensor_parallel_size: 1
            - training_type: sft
              finetuning_type: all_weights
              num_gpus: 1
              num_nodes: 1
              tensor_parallel_size: 1
          # -- Micro batch size for training.
          micro_batch_size: 1
          # -- Maximum sequence length for input tokens.
          max_seq_length: 4096
          # -- Number of model parameters.
          num_parameters: 1000000000
          # -- Model precision format.
          precision: bf16-mixed
          # -- Template for formatting prompts.
          prompt_template: "{prompt} {completion}"
    
        # -- Llama 3.2 1B Instruct model configuration.
        # @default -- This object has the following default values for the Llama 3.2 1B Instruct model.
        meta/llama-3.2-1b-instruct:
          # -- Whether to enable the model.
          enabled: false
          # -- NGC model URI for Llama 3.2 1B Instruct model.
          model_uri: ngc://nvidia/nemo/llama-3_2-1b-instruct:2.0
          # -- Path where model files are stored.
          model_path: llama32_1b-instruct
          # -- Training options for different fine-tuning methods.
          training_options:
            - training_type: sft
              finetuning_type: lora
              num_gpus: 1
              num_nodes: 1
              tensor_parallel_size: 1
            - training_type: sft
              finetuning_type: all_weights
              num_gpus: 1
              num_nodes: 1
              tensor_parallel_size: 1
          # -- Micro batch size for training.
          micro_batch_size: 1
          # -- Maximum sequence length for input tokens.
          max_seq_length: 4096
          # -- Number of model parameters.
          num_parameters: 1000000000
          # -- Model precision format.
          precision: bf16-mixed
          # -- Template for formatting prompts.
          prompt_template: "{prompt} {completion}"
    
        # -- Llama 3 70B Instruct model configuration.
        # @default -- This object has the following default values for the Llama 3 70B Instruct model.
        meta/llama3-70b-instruct:
          # -- Whether to enable the model.
          enabled: false
          # -- NGC model URI for Llama 3 70B Instruct model.
          model_uri: ngc://nvidia/nemo/llama-3-70b-instruct-nemo:2.0
          # -- Path where model files are stored.
          model_path: llama-3-70b-bf16
          # -- Training options for different fine-tuning methods.
          training_options:
            - training_type: sft
              finetuning_type: lora
              num_gpus: 4
              num_nodes: 1
              tensor_parallel_size: 4
          # -- Maximum sequence length for input tokens.
          max_seq_length: 4096
          # -- Number of model parameters.
          num_parameters: 70000000000
          # -- Micro batch size for training.
          micro_batch_size: 1
          # -- Model precision format.
          precision: bf16-mixed
          # -- Template for formatting prompts.
          prompt_template: "{prompt} {completion}"
    
        # -- Llama 3.1 8B Instruct model configuration.
        # @default -- This object has the following default values for the Llama 3.1 8B Instruct model.
        meta/llama-3.1-8b-instruct:
          # -- Whether to enable the model.
          enabled: true
          # -- NGC model URI for Llama 3.1 8B Instruct model.
          model_uri: ngc://nvidia/nemo/llama-3_1-8b-instruct-nemo:2.0
          # -- Path where model files are stored.
          model_path: llama-3_1-8b-instruct_0_0_1
          # -- Training options for different fine-tuning methods.
          training_options:
            - training_type: sft
              finetuning_type: lora
              num_gpus: 1
            - training_type: sft
              finetuning_type: all_weights
              num_gpus: 8
              num_nodes: 1
              tensor_parallel_size: 4
          # -- Micro batch size for training.
          micro_batch_size: 1
          # -- Maximum sequence length for input tokens.
          max_seq_length: 4096
          # -- Number of model parameters.
          num_parameters: 8000000000
          # -- Model precision format.
          precision: bf16-mixed
          # -- Template for formatting prompts.
          prompt_template: "{prompt} {completion}"
    
        # -- Llama 3.1 70B Instruct model configuration.
        # @default -- This object has the following default values for the Llama 3.1 70B Instruct model.
        meta/llama-3.1-70b-instruct:
          # -- Whether to enable the model.
          enabled: false
          # -- NGC model URI for Llama 3.1 70B Instruct model.
          model_uri: ngc://nvidia/nemo/llama-3_1-70b-instruct-nemo:2.0
          # -- Path where model files are stored.
          model_path: llama-3_1-70b-instruct_0_0_1
          # -- Training options for different fine-tuning methods.
          training_options:
            - training_type: sft
              finetuning_type: lora
              num_gpus: 4
              num_nodes: 1
              tensor_parallel_size: 4
          # -- Micro batch size for training.
          micro_batch_size: 1
          # -- Maximum sequence length for input tokens.
          max_seq_length: 4096
          # -- Number of model parameters.
          num_parameters: 70000000000
          # -- Model precision format.
          precision: bf16-mixed
          # -- Template for formatting prompts.
          prompt_template: "{prompt} {completion}"
    
        # -- Phi-4 model configuration.
        # @default -- This object has the following default values for the Phi-4.
        microsoft/phi-4:
          # -- Whether to enable the model.
          enabled: false
          # -- NGC model URI for Phi-4 model.
          model_uri: ngc://nvidia/nemo/phi-4:1.0
          # -- Path where model files are stored.
          model_path: phi-4
          # -- Training options for different fine-tuning methods.
          training_options:
            - training_type: sft
              finetuning_type: lora
              num_gpus: 1
              num_nodes: 1
          # -- Micro batch size for training.
          micro_batch_size: 1
          # -- Maximum sequence length for input tokens.
          max_seq_length: 4096
          # -- Number of model parameters.
          num_parameters: 14659507200
          # -- Model precision format.
          precision: bf16
          # -- Template for formatting prompts.
          prompt_template: "{prompt} {completion}"
    
        # -- Llama 3.3 70B Instruct model configuration.
        # @default -- This object has the following default values for the Llama 3.3 70B Instruct model.
        meta/llama-3.3-70b-instruct:
          # -- Whether to enable the model.
          enabled: false
          # -- NGC model URI for Llama 3.3 70B Instruct model.
          model_uri: ngc://nvidia/nemo/llama-3_3-70b-instruct:2.0
          # -- Path where model files are stored.
          model_path: llama-3_3-70b-instruct_0_0_1
          # -- Training options for different fine-tuning methods.
          training_options:
            - training_type: sft
              finetuning_type: lora
              num_gpus: 4
              num_nodes: 1
              tensor_parallel_size: 4
          # -- Micro batch size for training.
          micro_batch_size: 1
          # -- Maximum sequence length for input tokens.
          max_seq_length: 4096
          # -- Number of model parameters.
          num_parameters: 70000000000
          # -- Model precision format.
          precision: bf16-mixed
          # -- Template for formatting prompts.
          prompt_template: "{prompt} {completion}"
    
  4. Apply the ConfigMap file.

    $ kubectl apply -n nemo -f nemo-customizer-model-config.yaml
    
  5. Create a file, such as nemo-customizer.yaml, with contents like the following example:

    apiVersion: apps.nvidia.com/v1alpha1
    kind: NemoCustomizer
    metadata:
      name: nemocustomizer-sample
      namespace: nemo
    spec:
      # Scheduler configuration for training jobs.  Currently, only volcano is required.
      scheduler:
        type: "volcano"
      # Weights & Biases configuration for experiment tracking
      wandb:
        secretName: <wandb-secret>       # Kubernetes secret that stores WANDB_API_KEY and optionally encryption key
        apiKeyKey: <apiKey>                 # Key in the secret that holds the W&B API key
        encryptionKey: <encryptionKey>   # Key in the secret that holds optional encryption key
      # OpenTelemetry tracing configuration
      otel:
        enabled: true
        exporterOtlpEndpoint: http://<customizer-otel-opentelemetry-collector>.<nemo>.svc.cluster.local:4317
      # PostgreSQL database connection configuration
      databaseConfig:
        credentials:
          user: <ncsuser>                        # Database username
          secretName: <customizer-pg-existing-secret>  # Secret containing password
          passwordKey: <password>               # Key inside secret that contains the password
        host: <customizer-pg-postgresql>.<nemo>.svc.cluster.local
        port: 5432
        databaseName: <ncsdb>
      # Customizer API service exposure settings
      expose:
        service:
          type: ClusterIP
          port: 8000
      # Global image pull settings used in various subcomponents
      image:
        repository: nvcr.io/nvidia/nemo-microservices/customizer-api
        tag: "25.04"
        pullPolicy: IfNotPresent
        pullSecrets:
          - ngc-secret
      # URL to the NeMo Entity Store microservice
      entitystore:
        endpoint: http://<nemoentitystore-sample>.<nemo>.svc.cluster.local:8000
      # URL to the NeMo Data Store microservice
      datastore:
        endpoint: http://<nemodatastore-sample>.<nemo>.svc.cluster.local:8000
      # URL for MLflow tracking server
      mlflow: 
        endpoint: http://<mlflow-tracking>.<nemo>.svc.cluster.local:80
      # Configuration for the data store CLI tools
      nemoDatastoreTools:
        image: nvcr.io/nvidia/nemo-microservices/nds-v2-huggingface-cli:25.04
      # Configuration for model download jobs
      modelDownloadJobs:
        image: "nvcr.io/nvidia/nemo-microservices/customizer-api:25.04"
        ngcAPISecret:
          # Secret that stores NGC API key
          name: ngc-api-secret
          # Key inside secret         
          key: "NGC_API_KEY"                 
        securityContext:
          fsGroup: 1000
          runAsNonRoot: true
          runAsUser: 1000
          runAsGroup: 1000
         # Time (in seconds) to retain job after completion
        ttlSecondsAfterFinished: 600   
        # Polling frequency to check job status     
        pollIntervalSeconds: 15              
      # Name to the ConfigMap containing model definitions
      modelConfig:
        name: <nemo-model-config`
      # Training configuration
      trainingConfig:
        configMap:
          # Optional: Additional configuration to merge into training config
          name: <nemo-training-config>         
        # PVC where model artifacts are cached or used during training
        modelPVC:
          create: true
          name: <finetuning-ms-models-pvc>
          # StorageClass for the PVC (can be empty to use default)
          storageClass: ""
          volumeAccessMode: ReadWriteMany
          size: 50Gi
        # Workspace PVC automatically created per job
        workspacePVC:
          storageClass: ""
          volumeAccessMode: ReadWriteMany
          size: 10Gi
          # Mount path for workspace inside container
          mountPath: /pvc/workspace          
        image:
          repository: nvcr.io/nvidia/nemo-microservices/customizer
          tag: "25.04"
        env:
          - name: LOG_LEVEL
            value: INFO                    
        # Multi-node networking environment variables for training (CSPs)
        networkConfig:
          - name: NCCL_IB_SL
            value: "0"
          - name: NCCL_IB_TC
            value: "41"
          - name: NCCL_IB_QPS_PER_CONNECTION
            value: "4"
          - name: UCX_TLS
            value: TCP
          - name: UCX_NET_DEVICES
            value: eth0
          - name: HCOLL_ENABLE_MCAST_ALL
            value: "0"
          - name: NCCL_IB_GID_INDEX
            value: "3"
        # TTL for training job after it completes
        ttlSecondsAfterFinished: 3600       
        # Timeout duration (in seconds) for training job
        timeout: 3600                       
        # Node tolerations
        tolerations:
          - key: "nvidia.com/gpu"
            operator: "Exists"
            effect: "NoSchedule"
    
  6. Apply the manifest:

    $ kubectl apply -n nemo -f nemo-customizer.yaml
    

    Note

    NeMo Customizer image is large and it will take a few minutes to download from the registy.

Verify NeMo Customizer#

  1. View NeMo Customizer status:

    $ kubectl get nemocustomizer.apps.nvidia.com -n nemo
    

    Partial Output

     NAME                    STATUS     AGE
     nemocustomizer-sample   Ready   7s
    
  2. View information about the NeMo Customizer:

    $ kubectl describe nemocustomizer.apps.nvidia.com  nemocustomizer-sample -n nemo
    

    Partial Output

    ...
    Status:
     Conditions:
       Last Transition Time:  2025-04-24T17:40:04Z
       Message:               deployment "nemocustomizer-sample" successfully rolled out
    
       Reason:                Ready
       Status:                True
       Type:                  Ready
       Last Transition Time:  2025-04-24T17:39:34Z
       Message:
       Reason:                Ready
       Status:                False
       Type:                  Failed
     State:                   Ready
    

Check NeMo Customizer Service is Reachable#

Once you have a NeMo Customizer deployed on your cluster, use the steps below to verify the service is up and runnig.

  1. Start a pod that has access to the curl command. Substitute any pod that has this command and meets your organization’s security requirements.

    $ kubectl run --rm -it -n default curl --image=curlimages/curl:latest -- ash
    

    After the pod starts, you are connected to the ash shell in the pod.

  2. Connect to the NeMo Customizer service

    $ curl -X GET "http://nemocustomizer-sample.nemo:8000/v1/customization/configs"
    
  3. Press Ctrl+D to exit and delete the pod.

Configure NeMo Customizer#

The following table shows more information about the commonly modified fields for the NeMo Data Store custom resource.

Field

Description

Default Value

spec.annotations

Specifies to add the user-supplied annotations to the pod.

None

spec.databaseConfig (required)

Specifies the external PostgreSQL configuration details.

None

spec.databaseConfig.credentials.passwordKey

Specifies the password key used in the database credentials secret.

password

spec.databaseConfig.credentials.secretName (required)

Specifies the secret name for the database credentials.

None

spec.databaseConfig.credentials.user (required)

Specifies the user for the database.

None

spec.databaseConfig.databaseName (required)

Specifies the name for the database.

None

spec.databaseConfig.host (required)

Specifies the endpoint for the database.

None

spec.databaseConfig.port

Specifies the port for the database.

5432

spec.datastore.endpoint (required)

Specifies the endpoint for the NeMo Data Store to use for customization jobs.

none

spec.entitystore.endpoint (required)

Specifies the endpoint for the NeMo Entity Store to use for customization jobs.

none

spec.expose

Specifies attributes to expose a service for this NeMo microservice. Use an expose object to specify Kubernetes Ingress and Service information.

None

spec.expose.ingress.enabled

When set to true, the Operator creates a Ingress resource for the NeMo Customizer. Specify the ingress specification in the spec.expose.ingress.spec field.

If you have an ingress controller, values like the following sample configures an ingress for the / endpoint.

ingress:
  enabled: true
  spec:
    ingressClassName: nginx
    host: nemo-customizer.example.com
    paths:
      - path: /
        pathType: Prefix

false

spec.expose.service.port

Specifies the network port number for the NeMo Evaluator microservice.

8000

spec.expose.service.type

Specifies the Kubernetes service type to create for the NIM microservice.

ClusterIP

spec.groupID

Specifies the group for the pods. This value is used to set the security context of the pod in the runAsGroup and fsGroup fields.

2000

spec.image (required)

Specifies repository, tag, pull policy, and pull secret for the container image. You must specify the repository and tag for the NeMo microservice image you are using.

None

spec.labels

Specifies the user-supplied labels to add to the pod.

None

spec.metrics.enabled

When set to true, the Operator configures a Prometheus service monitor for the service. Specify the service monitor specification in the spec.metrics.serviceMonitor field. Refer to the Observability page for more details.

false

spec.mlflow (required)

Specifies the MLFlow tracking endpoint deployed on your cluster.

None

spec.modelConfig.name (required)

Specifies the name of the ConfigMap containing you model definitions.

None

spec.modelDownloadJobs.image (required)

Specifies the image to use for model downloader jobs.

None

spec.modelDownloadJobs.imagePullPolicy

Specifies the image pull policy to use for model downloader image.

None

spec.modelDownloadJobs.ngcAPISecret.key (required)

Specifies the name of the key in your NGC secret that contains your NGC API Key. Refer to the Image Pull Secrets page for more details on creating this secret.

None

spec.modelDownloadJobs.ngcAPISecret.name (required)

Specifies the secret name with your NGC API Key. Refer to the Image Pull Secrets page for more details on creating this secret.

None

spec.modelDownloadJobs.pollIntervalSeconds (required)

Specifies the polling interval for model download status.

None

spec.modelDownloadJobs.securityContext

Specifies the Kubernetes security context for the model downloader.

None

spec.modelDownloadJobs.ttlSecondsAfterFinished (required)

Specifies the time to live after the model downloader job finishes in seconds.

None

spec.nemoDatastoreTools.image (required)

Specifies the image to use for the NeMo Datastore CLI tools.

None

spec.otel.disableLogging

When set to true, Python logging auto-instrumentation is enabled.

None

spec.otel.disableLogging

When set to true, Python logging auto-instrumentation is enabled.

None

spec.otel.excludeUrls

Specifies URLs to be excluded from tracing.

None

spec.otel.exporterConfig.logsExporter

Specifies the log exporter. Values include otlp, console, none.

None

spec.otel.exporterConfig.metricsExporter

Specifies the metrics exporter. Values include otlp, console, none.

None

spec.otel.exporterConfig.traceExporter

Specifies the trace exporter. Values include otlp, console, none.

None

spec.otel.OtlpEndpoint

Specifies the OpenTelemetry Protocol endpoint.

None

spec.otel.logLevel

Specifies the log level for OpenTelemetry. Values include INFO and DEBUG.

None

spec.replicas

Specifies the number of replicas to have on the cluster.

None

spec.resources.requests

Specifies the memory and CPU request.

None

spec.resources.limits

Specifies the memory and CPU limits.

None

spec.scheduler

Specifies the scheduler type to use for cusotmization jobs. Available values are volcano.

None

spec.tolerations

Specifies the tolerations for the pods.

None

spec.trainingConfig.configMap.name (required)

Specifies a ConfigMap of your training configuration. Its recommended that you create the ConfigMap with your training configurations ahead of creating a NeMo Customizer. Note that if you make adjustments to your trianing configurations after deploying the NeMO Customizer, the service must be restarted. Refer to the NeMo Customizer configuration documentation for details on setting up your training configuration.

None

spec.trainingConfig.env

Specifies enviroment variables passed to training jobs.

None

spec.trainingConfig.image (required)

Specifies the repository, tag, pull policy, and pull secret for the NeMo Customizer image used for training. You must specify the repository and tag image you are using.

None

spec.trainingConfig.modelPVC.create (required)

When set to true, the Operator creates the PVC where model artifacts are cached or used during training. If you delete a NeMo customizer resource and this field was set to true, the Operator deletes the PVC and the cached models.

false

spec.trainingConfig.modelPVC.name (required)

Specifies the PVC name. This field is required if you specify create: false.

The NeMo Customizer resource name with a -pvc suffix.

spec.trainingConfig.modelPVC.size (required)

Specifies the size, in Gi, for the PVC to create.

This field is required if you specify create: true.

None

spec.trainingConfig.modelPVC.storageClass (required)

Specifies the Kubernetes StorageClass for the PVC. Leave this empty to use your cluster’s default StorageClase.

None

spec.trainingConfig.modelPVC.subPath

Specifies the subpath inside the PVC that is mounted. for the PVC to create.

None

spec.trainingConfig.modelPVC.volumeAccessMode (required)

Specifies the access mode for the PVC to create. NeMo Customzier requires a volume access mode of ReadWriteMany.

None

spec.trainingConfig.networkConfig

Specifies the network configuration for multi-node training. Use name and value pairs to define your network. For example,

  - name: NCCL_IB_SL
    value: "0"
  - name: NCCL_IB_TC
    value: "41"
  - name: NCCL_IB_QPS_PER_CONNECTION
    value: "4"
  - name: UCX_TLS
    value: TCP
  - name: UCX_NET_DEVICES
    value: eth0
  - name: HCOLL_ENABLE_MCAST_ALL
    value: "0"
  - name: NCCL_IB_GID_INDEX
    value: "3"

None

spec.trainingConfig.nodeSlector

Specifies the node selector labels for where to run training jobs.

None

spec.trainingConfig.podAffinity

Specifies the PodAffinity for the training jobs.

None

spec.trainingConfig.resources

Specifies the resources for the training jobs.

None

spec.trainingConfig.timeOut

Specifies the timeout limit for the training jobs to complete.

None

spec.trainingConfig.ttlSecondsAfterFinished

Specifies the time to live after the training job finishes in seconds.

None

spec.trainingConfig.workspacePVC (required)

Specifies PVC configuration for the NeMo Operator NemoTrainingJob custom resource. A PVC is automatically created for each job. Use the workspacePVC object to define how to deploy these PVCs.

None

spec.trainingConfig.workspacePVC.mountPath

Specifies the path where the workspace PVC is mounted within the training job.

/pvc/workspace

spec.trainingConfig.workspacePVC.size (required)

Specifies the size, in Gi, for the PVC to create.

None

spec.trainingConfig.workspacePVC.storageClass (required)

Specifies the Kubernetes StorageClass for the PVC. Leave this empty to use your cluster’s default StorageClase.

None

spec.trainingConfig.workspacePVC.volumeAccessMode (required)

Specifies the access mode for the PVC to create. NeMo Customizer requires a volume access mode of ReadWriteMany.

None

spec.userID

Specifies the user ID for the pod. This value is used to set the security context of the pod in the runAsUser fields.

1000

spec.wandbSecret.apiKeyKey (required)

Specifies the key in the secret that holds the Weights and Biases API key.

None

spec.wandbSecret.encryptionKey

Specifies an optional key in the secret used for encrypting Weights&Biases credentials. This can be used for additional security layers if required.

encryptionKey

spec.wandbSecret.name (required)

Specifies the name of the Kubernetes Secret containing the Weights&Biases API key.

None