Custom Resource Definitions#

When you install the entire NeMo platform or set up a NeMo Customizer workflow with NeMo Data and Entity Store, you need to install the NeMo Operator microservice. The NeMo Operator microservice leverages the operator framework to streamline and automate the lifecycle management of NeMo custom resources within Kubernetes. You also need to install NIM Operator and other external dependencies such as Volcano and Run:AI, which also involves managing custom resources.

This section provides a reference for the custom resource definitions (CRDs) managed by NeMo Operator and references to the CRDs managed by NVIDIA NIM Operator.

Custom Resource Definitions Managed by NeMo Operator#

NeMo Operator utilizes the custom resource definitions (CRDs) to manage and extend the native functionality of Kubernetes for the NeMo microservices. These NeMo Operator CRDs offer a declarative interface for orchestrating the NeMo microservice workflows.

Note

This reference covers the CRDs managed by NeMo Operator. The NeMo microservices utilize these CRDs, and you don’t interact with them directly.

Important

CRDs are only installed once during the initial NeMo Microservices Helm installation. After that, helm doesn’t handle CRD lifecycle. To properly upgrade to newer NeMo Operator versions while upgrading the NeMo Microservices Helm chart, see Upgrade NeMo Microservices Helm Chart.

NemoTrainingJob#

  • Group/Version: nvidia.com/v1alpha1

  • Kind: NemoTrainingJob

  • Purpose: This custom resource references NemoTrainingWorkload and NemoEntityHandlers to manage the full lifecycle of a training job in the NeMo ecosystem. NemoTrainingJob provisions Kubernetes resources to orchestrate the end-to-end training workflow. This includes provisioning persistent volumes, transferring datasets and models between the NeMo Data Store and the persistent volume, and creating lower-level Kubernetes resources to execute the training workloads.

  • Key Fields:

    • spec.source: Defines which models and datasets to use for training and how to fetch them using NemoEntityHandlers.

    • spec.trainingWorkload: Defines the configuration for the actual training workload such as image, command, and args. A NemoTrainingWorkload reference can contain these configurations.

    • spec.output: Defines the configuration for exporting the training output using a NemoEntityHandler.

    • spec.pvc: Defines the PersistentVolumeClaim (PVC) configuration for the training lifecycle.

NemoTrainingWorkload#

  • Group/Version: nvidia.com/v1alpha1

  • Kind: NemoTrainingWorkload

  • Purpose: This custom resource defines the configuration required to execute a training workload. The NemoTrainingJob uses the configuration defined in the NemoTrainingWorkload when creating lower-level Kubernetes resources to perform the training.

  • Key Fields:

    • spec.image: Defines the Docker image to run the training workload.

    • spec.container: Defines the entrypoint command for the training workload.

    • spec.args: Defines additional arguments to pass to the entrypoint command.

    • spec.resources: Defines the resources needed by the training workload such as GPUs, CPU, and memory.

NemoEntityHandler#

  • Kind: NemoEntityHandler

  • Purpose: NemoEntityHandlers define workloads for transferring entities (datasets and models) between the persistent volume and the NeMo Data Store during the training lifecycle. The commands defined in NemoEntityHandlers can reference certain environment variables that specify information such as the path of the entity to export, the identifier of the entity in the NeMo Data Store, and the entity type.

  • Key Fields:

    • spec.image: Defines image, customCommand, and additionalEnvVars required to perform the entity handling. The container mounts the training lifecycle’s persistent volume and uses predefined environment variables to specify entity locations.

    • spec.resources: Defines the resources that the pod needs for performing the entity handling.

CRDs in NIM Operator#

Refer to the following references for the CRDs you can manage through NIM Operator.

NIM Cache#

Refer to NIM Cache Custom Resource Definition in the NVIDIA NIM Operator documentation.

NIM Service#

Refer to NIM Service Custom Resource Definition in the NVIDIA NIM Operator documentation.

CRDs in Dependencies#

CRDs are also created by the dependencies of NeMo microservices. Refer to the corresponding documentation of the dependency libraries for the CRDs you need to manage.