Manage Jobs#

This section describes how to configure jobs in NeMo Platform. The Jobs service is responsible for scheduling batch jobs, collecting telemetry and managing job results.

Execution Profiles#

You can configure jobs to run on different hardware by specifying an execution profile. This allows you to define different compute resources and job execution backends for different job types and use cases.

An execution profile is defined by the following attributes:

A profile name (profile)
A compute provider (e.g. cpu or gpu)
An execution backend (e.g. docker, kubernetes_job, volcano_job)

By default, the NeMo Platform defines a default CPU and default GPU provider to launch CPU and GPU bound jobs.

You can configure multiple execution profiles to suit the shape of your compute environment. For example, if you have a compute environment with heterogeneous infrastructure (e.g., two types of GPU hardware such as A100 and H200), you can define a list of execution profiles as follows:

jobs:
  executors:
    # Allow any CPU-bound job to run anywhere
    - provider: cpu
      profile: default
      backend: kubernetes_job
      config: {...}
    
    # Run on A100 hardware
    - provider: gpu
      profile: a100-pool
      backend: kubernetes_job
      config:
        # use the appropriate pod scheduling for a100-pool
        node_selector:
          node-pool-name: a100-pool

    # Run on H200 hardware
    - provider: gpu
      profile: h200-pool
      backend: kubernetes_job
      config:
        # use the appropriate pod scheduling for h200-pool
        node_selector:
          node-pool-name: h200-pool

For full configuration details, see the platform configuration reference.

Default Execution Profiles#

The NeMo Platform defines a default execution profile for each execution backend depending on the platform’s control plane. The default execution profile is used when no specific execution profile is specified for a job.

Default CPU Execution Profile (cpu): The default execution profile for CPU based jobs.
Default GPU Execution Profile (gpu): The default execution profile for GPU based jobs.

You may configure the default execution profiles by updating the executor_defaults section of the jobs section of the platform configuration. The structure of the executor_defaults matches the configuration of the execution backend configuration.

jobs:
  executor_defaults:
    # Override the default Docker execution profile configuration
    docker:
      storage:
        volume_name: nemo-jobs-storage

    # Override the default Kubernetes execution profile configuration
    kubernetes_job:
      tolerations:
        - key: nvidia.com/gpu
          operator: Exists
          effect: NoSchedule
      node_selector:
        kubernetes.io/arch: amd64

    # Override the default Volcano execution profile configuration
    volcano_job:
      storage:
        pvc_name: nemo-jobs-storage
      tolerations:
        - key: nvidia.com/gpu
          operator: Exists
          effect: NoSchedule
      node_selector:
        kubernetes.io/arch: amd64

Execution Backends#

Execution backends are the containerized job execution systems that NeMo Platform jobs are scheduled to run on. Each execution backend is responsible for launching and managing the job containers, and the Jobs service communicates with the execution backend to schedule and manage jobs.

NeMo Platform currently supports the following execution backends:

Docker (docker)
Kubernetes Jobs (kubernetes_job)
Volcano Jobs (volcano_job)

Docker#

The NeMo Platform supports Docker as an execution backend for CPU and GPU based jobs. When the shared GPU pool is configured (see below), Docker will use only the configured GPU devices and will not over-schedule jobs if there are not currently enough GPUs available.

Note: The NeMo Platform supports Docker-based job execution by default when running the NeMo Platform in quickstart mode, which requires no configuration.

Tip

If you are running GPU jobs with Docker, see GPU Configuration Overview for information on configuring the shared GPU pool. This configuration is shared between the jobs and models services to prevent GPU resource conflicts.

jobs:
  executors:
    # Define the default CPU provider
    - provider: cpu
      profile: default
      backend: docker
      config:
        storage:
          volume_name: nemo-platform_jobs_storage

    # Define the default GPU provider
    - provider: gpu
      profile: default
      backend: docker
      config:
        storage:
          volume_name: nemo-microservices_jobs_storage

Kubernetes Jobs#

The NeMo Platform supports Kubernetes Jobs as an execution backend for CPU and GPU based jobs.

jobs:
  executors:
    # Define the default CPU provider
    - profile: default
      backend: kubernetes_job
      provider: cpu
      config:
        # Storage is the kubernetes_job storage configuration
        storage:
          # Define the name of a persistent volume claim that will be used by launched jobs
          pvc_name: nemo-core-jobs-storage

    # Define the default GPU provider
    - profile: default
      backend: kubernetes_job
      provider: gpu
      config:
        storage:
          pvc_name: nemo-core-jobs-storage
        # You can configure custom labels and annotations on Kubernetes Jobs and their pods. This may be useful within environments that require adding integration with service meshes or similar cluster-level integrations.
        job_metadata:
          labels:
            my-custom-label: "value"
          annotations:
            example.com/annotation: "value"
        pod_metadata:
          labels:
            sidecar.istio.io/inject: "false"
          annotations:
            example.com/annotation: "value"
        # You can configure typical Kubernetes pod scheduling behavior via node selectors, tolerations, and node/pod affinities.
        node_selector:
          kubernetes.io/arch: amd64
        tolerations:
          - key: nvidia.com/gpu
            operator: Exists
            effect: NoSchedule
        affinity:
          nodeAffinity:
            requiredDuringSchedulingIgnoredDuringExecution:
              nodeSelectorTerms:
                - matchExpressions:
                    - key: highmem
                      operator: In
                      values:
                        - "true"

Volcano Jobs#

The NeMo Platform supports Volcano Jobs as an execution backend for launching distributed GPU jobs.

Volcano Jobs are configured using the same configuration as Kubernetes Jobs, but with the following additional configuration:

queue: The Volcano queue to submit the job to.
scheduler_name: The Volcano scheduler to use for the job.
plugins: The Volcano plugins to use for the job.
max_retry: The maximum number of retries for the job.
enable_multi_node_networking: Enable multi-node networking injection. Sets annotations to trigger Kyverno policy mutations. This is only available if the platform is configured to use multi-node networking (see Multinode Networking).

jobs:
  executors:
    - profile: default
      backend: volcano_job
      provider: gpu_distributed
      config:
        queue: default
        scheduler_name: volcano
        plugins:
          pytorch: ["--master=leader", "--worker=worker", "--port=23456"]
        max_retry: 0
      # Additional configuration for the Volcano Job, shared with Kubernetes Jobs
      ...