Manage Jobs#
This section describes how to configure jobs in NeMo Platform. The Jobs service is responsible for scheduling batch jobs, collecting telemetry and managing job results.
Execution Profiles#
You can configure jobs to run on different hardware by specifying an execution profile. This allows you to define different compute resources and job execution backends for different job types and use cases.
An execution profile is defined by the following attributes:
A profile name (
profile)A compute provider (e.g.
cpuorgpu)An execution backend (e.g.
docker,kubernetes_job,volcano_job)
By default, the NeMo Platform defines a default CPU and default GPU provider to launch CPU and GPU bound jobs.
You can configure multiple execution profiles to suit the shape of your compute environment. For example, if you have a compute environment with heterogeneous infrastructure (e.g., two types of GPU hardware such as A100 and H200), you can define a list of execution profiles as follows:
jobs:
executors:
# Allow any CPU-bound job to run anywhere
- provider: cpu
profile: default
backend: kubernetes_job
config: {...}
# Run on A100 hardware
- provider: gpu
profile: a100-pool
backend: kubernetes_job
config:
# use the appropriate pod scheduling for a100-pool
node_selector:
node-pool-name: a100-pool
# Run on H200 hardware
- provider: gpu
profile: h200-pool
backend: kubernetes_job
config:
# use the appropriate pod scheduling for h200-pool
node_selector:
node-pool-name: h200-pool
For full configuration details, see the platform configuration reference.
Default Execution Profiles#
The NeMo Platform defines a default execution profile for each execution backend depending on the platform’s control plane. The default execution profile is used when no specific execution profile is specified for a job.
Default CPU Execution Profile (
cpu): The default execution profile for CPU based jobs.Default GPU Execution Profile (
gpu): The default execution profile for GPU based jobs.
You may configure the default execution profiles by updating the executor_defaults section of the jobs section of the platform configuration. The structure of the executor_defaults matches the configuration of the execution backend configuration.
jobs:
executor_defaults:
# Override the default Docker execution profile configuration
docker:
storage:
volume_name: nemo-jobs-storage
# Override the default Kubernetes execution profile configuration
kubernetes_job:
tolerations:
- key: nvidia.com/gpu
operator: Exists
effect: NoSchedule
node_selector:
kubernetes.io/arch: amd64
# Override the default Volcano execution profile configuration
volcano_job:
storage:
pvc_name: nemo-jobs-storage
tolerations:
- key: nvidia.com/gpu
operator: Exists
effect: NoSchedule
node_selector:
kubernetes.io/arch: amd64
Execution Backends#
Execution backends are the containerized job execution systems that NeMo Platform jobs are scheduled to run on. Each execution backend is responsible for launching and managing the job containers, and the Jobs service communicates with the execution backend to schedule and manage jobs.
NeMo Platform currently supports the following execution backends:
Docker (
docker)Kubernetes Jobs (
kubernetes_job)Volcano Jobs (
volcano_job)
Docker#
The NeMo Platform supports Docker as an execution backend for CPU and GPU based jobs. When the shared GPU pool is configured (see below), Docker will use only the configured GPU devices and will not over-schedule jobs if there are not currently enough GPUs available.
Note: The NeMo Platform supports Docker-based job execution by default when running the NeMo Platform in quickstart mode, which requires no configuration.
Tip
If you are running GPU jobs with Docker, see GPU Configuration Overview for information on configuring the shared GPU pool. This configuration is shared between the jobs and models services to prevent GPU resource conflicts.
jobs:
executors:
# Define the default CPU provider
- provider: cpu
profile: default
backend: docker
config:
storage:
volume_name: nemo-platform_jobs_storage
# Define the default GPU provider
- provider: gpu
profile: default
backend: docker
config:
storage:
volume_name: nemo-microservices_jobs_storage
Kubernetes Jobs#
The NeMo Platform supports Kubernetes Jobs as an execution backend for CPU and GPU based jobs.
jobs:
executors:
# Define the default CPU provider
- profile: default
backend: kubernetes_job
provider: cpu
config:
# Storage is the kubernetes_job storage configuration
storage:
# Define the name of a persistent volume claim that will be used by launched jobs
pvc_name: nemo-core-jobs-storage
# Define the default GPU provider
- profile: default
backend: kubernetes_job
provider: gpu
config:
storage:
pvc_name: nemo-core-jobs-storage
# You can configure custom labels and annotations on Kubernetes Jobs and their pods. This may be useful within environments that require adding integration with service meshes or similar cluster-level integrations.
job_metadata:
labels:
my-custom-label: "value"
annotations:
example.com/annotation: "value"
pod_metadata:
labels:
sidecar.istio.io/inject: "false"
annotations:
example.com/annotation: "value"
# You can configure typical Kubernetes pod scheduling behavior via node selectors, tolerations, and node/pod affinities.
node_selector:
kubernetes.io/arch: amd64
tolerations:
- key: nvidia.com/gpu
operator: Exists
effect: NoSchedule
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: highmem
operator: In
values:
- "true"
Volcano Jobs#
The NeMo Platform supports Volcano Jobs as an execution backend for launching distributed GPU jobs.
Volcano Jobs are configured using the same configuration as Kubernetes Jobs, but with the following additional configuration:
queue: The Volcano queue to submit the job to.scheduler_name: The Volcano scheduler to use for the job.plugins: The Volcano plugins to use for the job.max_retry: The maximum number of retries for the job.enable_multi_node_networking: Enable multi-node networking injection. Sets annotations to trigger Kyverno policy mutations. This is only available if the platform is configured to use multi-node networking (see Multinode Networking).
jobs:
executors:
- profile: default
backend: volcano_job
provider: gpu_distributed
config:
queue: default
scheduler_name: volcano
plugins:
pytorch: ["--master=leader", "--worker=worker", "--port=23456"]
max_retry: 0
# Additional configuration for the Volcano Job, shared with Kubernetes Jobs
...