KubeRayExecutor#
Configure Ray clusters and jobs on Kubernetes via the KubeRay operator.
Note
KubeRayExecutor is not used directly with run.Experiment. It is passed to RayCluster and RayJob helpers. For the full Ray workflow see Ray Clusters & Jobs.
Prerequisites#
kubectlconfigured with access to your Kubernetes cluster (kubectl cluster-infoshould succeed)KubeRay operator installed in the cluster
A container image with Ray installed (e.g.
anyscale/ray:2.43.0-py312-cu125)
Executor configuration#
from nemo_run.core.execution.kuberay import KubeRayExecutor, KubeRayWorkerGroup
executor = KubeRayExecutor(
namespace="my-k8s-namespace",
ray_version="2.43.0",
image="anyscale/ray:2.43.0-py312-cu125",
head_cpu="4",
head_memory="12Gi",
worker_groups=[
KubeRayWorkerGroup(
group_name="worker",
replicas=2,
gpus_per_worker=8,
)
],
env_vars={
"HF_HOME": "/workspace/hf_cache",
},
)
Key parameters:
Parameter |
Description |
|---|---|
|
Kubernetes namespace for Ray resources |
|
Ray version string (must match the image) |
|
Ray container image |
|
Resources for the head pod |
|
List of |
KubeRayWorkerGroup parameters:
Parameter |
Description |
|---|---|
|
Arbitrary name for the worker group |
|
Number of worker pods |
|
GPUs per worker pod |
E2E workflow#
Use KubeRayExecutor with RayCluster and RayJob from nemo_run.run.ray:
from nemo_run.core.execution.kuberay import KubeRayExecutor, KubeRayWorkerGroup
from nemo_run.run.ray.cluster import RayCluster
from nemo_run.run.ray.job import RayJob
executor = KubeRayExecutor(
namespace="ml-team",
ray_version="2.43.0",
image="anyscale/ray:2.43.0-py312-cu125",
worker_groups=[
KubeRayWorkerGroup(group_name="worker", replicas=2, gpus_per_worker=8),
],
)
# 1. Start the cluster
cluster = RayCluster(name="my-kuberay-cluster", executor=executor)
cluster.start(timeout=900)
cluster.port_forward(port=8265, target_port=8265, wait=False) # dashboard
# 2. Submit a job
job = RayJob(name="my-job", executor=executor)
job.start(
command="python train.py --config cfgs/train.yaml",
workdir="/path/to/project/",
)
job.logs(follow=True)
# 3. Clean up
cluster.stop()
Advanced options#
Persistent volume mounts#
executor = KubeRayExecutor(
...,
volume_mounts=[{"name": "workspace", "mountPath": "/workspace"}],
volumes=[{
"name": "workspace",
"persistentVolumeClaim": {"claimName": "my-workspace-pvc"},
}],
reuse_volumes_in_worker_groups=True, # also mount PVCs on workers
)
Custom scheduler (e.g. Run:ai)#
executor = KubeRayExecutor(
...,
spec_kwargs={"schedulerName": "runai-scheduler"},
)
Pre-Ray commands#
Commands injected into head and worker containers before Ray starts:
cluster.start(
timeout=900,
pre_ray_start_commands=[
"pip install uv",
"echo 'unset RAY_RUNTIME_ENV_HOOK' >> /home/ray/.bashrc",
],
)