Per-Stage Runtime Environments
Per-Stage Runtime Environments
Per-Stage Runtime Environments
Run pipeline stages with different Python package versions in the same pipeline. Each stage can declare a runtime_env that tells Ray to create an isolated virtualenv for that stage’s workers, so incompatible library versions coexist without conflicts.
Some curation pipelines require stages that depend on different versions of the same library. For example, one stage might need transformers==4.40.0 for a specific model checkpoint, while another stage needs transformers==4.45.0 for a newer model. Without isolation, these stages cannot coexist in the same pipeline.
Per-stage runtime environments solve this by using Ray’s native runtime_env support. When a stage declares a runtime_env, Ray creates and caches an isolated virtualenv under /tmp/ray/session_latest/runtime_resources/pip/<hash>/virtualenv. Each unique dependency set gets its own cached virtualenv, reused for the lifetime of the Ray session. No driver-side virtualenv creation or PYTHONPATH manipulation is needed.
Set runtime_env as a class variable on your ProcessingStage subclass:
Use with_() to change the runtime environment for a specific pipeline without modifying the stage class:
uv as the package installerRay also supports uv as the package installer inside worker virtualenvs. Use the "uv" key instead of "pip" for faster installs:
Both "pip" and "uv" keys work regardless of which package manager your local environment uses. The key only controls which installer Ray uses inside the worker virtualenv.
Per-stage runtime environments work with all three execution backends:
runtime_env specification. The first task dispatched to a new runtime_env triggers virtualenv creation; subsequent tasks reuse the cached environment.runtime_env (the default) run in the base Python environment with no isolation overhead.The NeMo Curator container image creates its virtualenv with uv venv --seed, which ensures that pip is available inside the venv. This is required because Ray’s pip-based runtime_env plugin clones the current virtualenv and needs pip to install stage-specific packages in the clone.
If you are running outside the official container, make sure your Python environment has pip available:
This example runs three stages in a single pipeline, each seeing a different version of the packaging library. It uses RecordPackagingVersionStage, a test stage from PR #1623 that records the packaging library version visible to each worker:
runtime_env incurs virtualenv creation time. Subsequent tasks reuse the cached environment.runtime_env creates a separate virtualenv on each worker node. Monitor disk space under /tmp/ray/ for large clusters with many distinct environments.