Configure Jobs#

This section describes how to configure the jobs component of the NeMo Core microservice. This component is responsible for scheduling jobs and is the basis for how the following functional microservices execute jobs.

NeMo Auditor
NeMo Data Designer
NeMo Evaluator
NeMo Safe Synthesizer

Executors#

You can configure jobs through executors, which target different job execution backends and provide CPU and GPU compute.

You can define job executors as a combination of the following attributes:

A profile name (profile)
A compute provider (e.g. cpu or gpu)
An execution backend (e.g. docker, kubernetes_job)

The Core microservice supports the following execution backends:

Docker (docker)
Kubernetes Jobs (kubernetes_job)

Profiles#

You can configure multiple execution profiles to suit the shape of your compute environment.

By default, the Core microservice defines a default CPU and/or default GPU provider to launch CPU and/or GPU bound jobs.

You can define any number of execution providers. For example, if you have a compute environment with heterogeneous infrastructure (e.g., two types of GPU hardware such as A100 and H200), you can define a list of execution profiles as follows:

jobs:
  executors:
    # Allow any CPU-bound job to run anywhere
    - provider: cpu
      profile: default
      backend: kubernetes_job
      config: {...}
    
    # Run on A100 hardware
    - provider: gpu
      profile: a100-pool
      backend: kubernetes_job
      config:
        # use the appropriate pod scheduling for a100-pool
        node_selector:
          node-pool-name: a100-pool

    # Run on H200 hardware
    - provider: gpu
      profile: h200-pool
      backend: kubernetes_job
      config:
        # use the appropriate pod scheduling for h200-pool
        node_selector:
          node-pool-name: h200-pool

Execution Backends#

Execution backends are the containerized job execution systems that NeMo microservices platform jobs are scheduled into.

Docker#

The Core microservice supports Docker as an execution backend for CPU and GPU based jobs.

Note: The Core microservice supports Docker-based job execution by default when running the NeMo microservices platform quickstart, which requires no configuration.

jobs:
  executors:
    # Define the default CPU provider
    - provider: cpu
      profile: default
      backend: docker
      config:
        storage:
          volume_name: nemo-microservices_jobs_storage

    # Define the default GPU provider
    - provider: gpu
      profile: default
      backend: docker
      config:
        storage:
          volume_name: nemo-microservices_jobs_storage

Kubernetes Jobs#

The Core microservice supports Kubernetes Jobs as an execution backend for CPU and GPU based jobs.

Note: When deploying using the NeMo Microservices Helm Chart, the logging, storage, and image_pull_secrets configurations are automatically configured for you. They are documented here for transparency.

Note: See the following Core microservice config for advanced configuration.

jobs:
  executors:
    # Define the default CPU provider
    - profile: default
      backend: kubernetes_job
      provider: cpu
      config:
        # Storage is the kubernetes_job storage configuration
        storage:
          # Define the name of a persistent volume claim that will be used by launched jobs
          pvc_name: nemo-core-jobs-storage
        # Logging is the logging configuration for a kubernetes_job
        logging:
          configmap: nemo-core-jobs-logsidecar  
          image:
            repository: fluent/fluent-bit
            tag: 4.0.7
        # Image_pull_secrets is the list of image pull secrets needed to pull images from container repositories
        image_pull_secrets:
          - name: nvcrimagepullsecret

    # Define the default GPU provider
    - profile: default
      backend: kubernetes_job
      provider: gpu
      config:
        storage:
          pvc_name: nemo-core-jobs-storage
        logging:
          configmap: nemo-core-jobs-logsidecar  
          image:
            repository: fluent/fluent-bit
            tag: 4.0.7
        image_pull_secrets:
          - name: nvcrimagepullsecret
        # You can configure custom labels and annotations on Kubernetes Jobs and their pods. This may be useful within environments that require adding integration with service meshes or similar cluster-level integrations.
        job_metadata:
          labels:
            my-custom-label: "value"
          annotations:
            example.com/annotation: "value"
        pod_metadata:
          labels:
            sidecar.istio.io/inject: "false"
          annotations:
            example.com/annotation: "value"
        # You can configure typical Kubernetes pod scheduling behavior via node selectors, tolerations, and node/pod affinities.
        node_selector:
          kubernetes.io/arch: amd64
        tolerations:
          - key: nvidia.com/gpu
            operator: Exists
            effect: NoSchedule
        affinity:
          nodeAffinity:
            requiredDuringSchedulingIgnoredDuringExecution:
              nodeSelectorTerms:
                - matchExpressions:
                    - key: highmem
                      operator: In
                      values:
                        - "true"

Secrets for Jobs#

Jobs that have secret values like API keys use ephemeral secret storage.

When a job launches, secrets are stored in ephemeral secret storage and then accessed by the job. When a job terminates, the secrets are purged from ephemeral secret storage.

Vault / OpenBao#

The Core microservice supports Vault and OpenBao for configuring job secret storage.

Note: In this release, the Core microservice only supports using token-based authentication. It is not currently recommended to use Vault/OpenBao in a production deployment.

jobs:
  secrets:
    # Configure job secret storage to use Vault backend
    backend: vault

    # Vault is the Vault/OpenBao configuration
    vault:
      address: http://openbao:8200
      token: your-vault-token
      prefix: /nemo/jobs

Kubernetes#

The Core microservice supports Kubernetes Secrets for configuring job secret storage.

Note: When deploying using the NeMo Microservices Helm Chart, this configuration is automatically configured. The following configuration is the default values, provided for reference.

jobs:
  secrets:
    # Configure job secret storage to use Kubernetes backend
    backend: kubernetes

    # Kubernetes is the Kubernetes secret storage config
    kubernetes:

      # Configure access via in-cluster or kubeconfig. Defaults to in-cluster.
      config_type: in-cluster

      # The namespace where jobs are created and managed.
      # Defaults to the namespace where NMP is deployed.
      namespace: your-deployment-namespace