KubeRayExecutor#

Configure Ray clusters and jobs on Kubernetes via the KubeRay operator.

Note

KubeRayExecutor is not used directly with run.Experiment. It is passed to RayCluster and RayJob helpers. For the full Ray workflow see Ray Clusters & Jobs.

Prerequisites#

  • kubectl configured with access to your Kubernetes cluster (kubectl cluster-info should succeed)

  • KubeRay operator installed in the cluster

  • A container image with Ray installed (e.g. anyscale/ray:2.43.0-py312-cu125)

Executor configuration#

from nemo_run.core.execution.kuberay import KubeRayExecutor, KubeRayWorkerGroup

executor = KubeRayExecutor(
    namespace="my-k8s-namespace",
    ray_version="2.43.0",
    image="anyscale/ray:2.43.0-py312-cu125",
    head_cpu="4",
    head_memory="12Gi",
    worker_groups=[
        KubeRayWorkerGroup(
            group_name="worker",
            replicas=2,
            gpus_per_worker=8,
        )
    ],
    env_vars={
        "HF_HOME": "/workspace/hf_cache",
    },
)

Key parameters:

Parameter

Description

namespace

Kubernetes namespace for Ray resources

ray_version

Ray version string (must match the image)

image

Ray container image

head_cpu / head_memory

Resources for the head pod

worker_groups

List of KubeRayWorkerGroup definitions

KubeRayWorkerGroup parameters:

Parameter

Description

group_name

Arbitrary name for the worker group

replicas

Number of worker pods

gpus_per_worker

GPUs per worker pod

E2E workflow#

Use KubeRayExecutor with RayCluster and RayJob from nemo_run.run.ray:

from nemo_run.core.execution.kuberay import KubeRayExecutor, KubeRayWorkerGroup
from nemo_run.run.ray.cluster import RayCluster
from nemo_run.run.ray.job import RayJob

executor = KubeRayExecutor(
    namespace="ml-team",
    ray_version="2.43.0",
    image="anyscale/ray:2.43.0-py312-cu125",
    worker_groups=[
        KubeRayWorkerGroup(group_name="worker", replicas=2, gpus_per_worker=8),
    ],
)

# 1. Start the cluster
cluster = RayCluster(name="my-kuberay-cluster", executor=executor)
cluster.start(timeout=900)
cluster.port_forward(port=8265, target_port=8265, wait=False)  # dashboard

# 2. Submit a job
job = RayJob(name="my-job", executor=executor)
job.start(
    command="python train.py --config cfgs/train.yaml",
    workdir="/path/to/project/",
)
job.logs(follow=True)

# 3. Clean up
cluster.stop()

Advanced options#

Persistent volume mounts#

executor = KubeRayExecutor(
    ...,
    volume_mounts=[{"name": "workspace", "mountPath": "/workspace"}],
    volumes=[{
        "name": "workspace",
        "persistentVolumeClaim": {"claimName": "my-workspace-pvc"},
    }],
    reuse_volumes_in_worker_groups=True,  # also mount PVCs on workers
)

Custom scheduler (e.g. Run:ai)#

executor = KubeRayExecutor(
    ...,
    spec_kwargs={"schedulerName": "runai-scheduler"},
)

Pre-Ray commands#

Commands injected into head and worker containers before Ray starts:

cluster.start(
    timeout=900,
    pre_ray_start_commands=[
        "pip install uv",
        "echo 'unset RAY_RUNTIME_ENV_HOOK' >> /home/ray/.bashrc",
    ],
)