Custom Resource Definitions#
When you install the entire NeMo platform or set up a NeMo Customizer workflow with NeMo Data and Entity Store, you need to install the NeMo Operator microservice. The NeMo Operator microservice leverages the operator framework to streamline and automate the lifecycle management of NeMo custom resources within Kubernetes. You also need to install NIM Operator and other external dependencies such as Volcano and Run:AI, which also involves managing custom resources.
This section provides a reference for the custom resource definitions (CRDs) managed by NeMo Operator and references to the CRDs managed by NVIDIA NIM Operator.
Custom Resource Definitions Managed by NeMo Operator#
NeMo Operator utilizes the custom resource definitions (CRDs) to manage and extend the native functionality of Kubernetes for the NeMo microservices. These NeMo Operator CRDs offer a declarative interface for orchestrating the NeMo microservice workflows.
Note
This reference covers the CRDs managed by NeMo Operator. The NeMo microservices utilize these CRDs, and you don’t interact with them directly.
Important
CRDs are only installed once during the initial NeMo Microservices Helm installation. After that, helm doesn’t handle CRD lifecycle. To properly upgrade to newer NeMo Operator versions while upgrading the NeMo Microservices Helm chart, see Upgrade NeMo Microservices Helm Chart.
NemoTrainingJob#
Group/Version:
nvidia.com/v1alpha1
Kind:
NemoTrainingJob
Purpose: This custom resource references
NemoTrainingWorkload
andNemoEntityHandler
s to manage the full lifecycle of a training job in the NeMo ecosystem.NemoTrainingJob
provisions Kubernetes resources to orchestrate the end-to-end training workflow. This includes provisioning persistent volumes, transferring datasets and models between the NeMo Data Store and the persistent volume, and creating lower-level Kubernetes resources to execute the training workloads.Key Fields:
spec.source
: Defines which models and datasets to use for training and how to fetch them usingNemoEntityHandler
s.spec.trainingWorkload
: Defines the configuration for the actual training workload such asimage
,command
, andargs
. ANemoTrainingWorkload
reference can contain these configurations.spec.output
: Defines the configuration for exporting the training output using aNemoEntityHandler
.spec.pvc
: Defines the PersistentVolumeClaim (PVC) configuration for the training lifecycle.
NemoTrainingWorkload#
Group/Version:
nvidia.com/v1alpha1
Kind:
NemoTrainingWorkload
Purpose: This custom resource defines the configuration required to execute a training workload. The
NemoTrainingJob
uses the configuration defined in theNemoTrainingWorkload
when creating lower-level Kubernetes resources to perform the training.Key Fields:
spec.image
: Defines the Docker image to run the training workload.spec.container
: Defines the entrypoint command for the training workload.spec.args
: Defines additional arguments to pass to the entrypoint command.spec.resources
: Defines the resources needed by the training workload such as GPUs, CPU, and memory.
NemoEntityHandler#
Kind:
NemoEntityHandler
Purpose:
NemoEntityHandler
s define workloads for transferring entities (datasets and models) between the persistent volume and the NeMo Data Store during the training lifecycle. The commands defined inNemoEntityHandlers
can reference certain environment variables that specify information such as the path of the entity to export, the identifier of the entity in the NeMo Data Store, and the entity type.Key Fields:
spec.image
: Definesimage
,customCommand
, andadditionalEnvVars
required to perform the entity handling. The container mounts the training lifecycle’s persistent volume and uses predefined environment variables to specify entity locations.spec.resources
: Defines the resources that the pod needs for performing the entity handling.
CRDs in NIM Operator#
Refer to the following references for the CRDs you can manage through NIM Operator.
NIM Cache#
Refer to NIM Cache Custom Resource Definition in the NVIDIA NIM Operator documentation.
NIM Service#
Refer to NIM Service Custom Resource Definition in the NVIDIA NIM Operator documentation.
CRDs in Dependencies#
CRDs are also created by the dependencies of NeMo microservices. Refer to the corresponding documentation of the dependency libraries for the CRDs you need to manage.
CRDs in Volcano: Refer to the Volcano documentation.
CRDs in Run:AI: Refer to the Run:AI documentation.
CRDs in Argo Workflows: Refer to the Argo Workflows documentation.
CRDs in Milvus: Refer to the Milvus documentation.