Lepton AI Deployment via Launcher#
Deploy and evaluate models on Lepton AI cloud platform using NeMo Evaluator Launcher orchestration. This approach provides scalable cloud inference with managed infrastructure.
Overview#
Lepton launcher-orchestrated deployment:
Deploys models on Lepton AI cloud platform
Provides managed infrastructure and scaling
Supports various resource shapes and configurations
Handles deployment lifecycle in the cloud
Quick Start#
# Deploy and evaluate on Lepton AI
nv-eval run \
--config-dir examples \
--config-name lepton_vllm_llama_3_1_8b_instruct \
-o deployment.checkpoint_path=meta-llama/Llama-3.1-8B-Instruct \
-o deployment.lepton_config.resource_shape=gpu.1xh200
This command:
Deploys a vLLM endpoint on Lepton AI
Runs the configured evaluation tasks
Returns an invocation ID for monitoring
The launcher handles endpoint creation, evaluation execution, and provides cleanup commands.
Prerequisites#
Lepton AI Setup#
# Install Lepton AI CLI
pip install leptonai
# Authenticate with Lepton AI
lep login
Refer to the Lepton AI documentation for authentication and workspace configuration.
Deployment Types#
vLLM Lepton Deployment#
High-performance inference with cloud scaling:
Refer to the complete working configuration in examples/lepton_vllm_llama_3_1_8b_instruct.yaml
. Key configuration sections:
deployment:
type: vllm
checkpoint_path: meta-llama/Llama-3.1-8B-Instruct
served_model_name: llama-3.1-8b-instruct
tensor_parallel_size: 1
lepton_config:
resource_shape: gpu.1xh200
min_replicas: 1
max_replicas: 3
auto_scaler:
scale_down:
no_traffic_timeout: 3600
execution:
type: lepton
evaluation_tasks:
timeout: 3600
evaluation:
tasks:
- name: ifeval
The launcher automatically retrieves the endpoint URL after deployment, eliminating the need for manual URL configuration.
NIM Lepton Deployment#
Enterprise-grade serving in the cloud. Refer to the complete working configuration in examples/lepton_nim_llama_3_1_8b_instruct.yaml
:
deployment:
type: nim
image: nvcr.io/nim/meta/llama-3.1-8b-instruct:1.8.6
served_model_name: meta/llama-3.1-8b-instruct
lepton_config:
resource_shape: gpu.1xh200
min_replicas: 1
max_replicas: 3
auto_scaler:
scale_down:
no_traffic_timeout: 3600
execution:
type: lepton
evaluation:
tasks:
- name: ifeval
SGLang Deployment#
SGLang is also supported as a deployment type. Use deployment.type: sglang
with similar configuration to vLLM.
Resource Shapes#
Resource shapes are Lepton platform-specific identifiers that determine the compute resources allocated to your deployment. Available shapes depend on your Lepton workspace configuration and quota.
Configure in your deployment:
deployment:
lepton_config:
resource_shape: gpu.1xh200 # Example: Check your Lepton workspace for available shapes
Refer to the Lepton AI documentation or check your workspace settings for available resource shapes in your environment.
Configuration Examples#
Auto-Scaling Configuration#
Configure auto-scaling behavior through the lepton_config.auto_scaler
section:
deployment:
lepton_config:
min_replicas: 1
max_replicas: 3
auto_scaler:
scale_down:
no_traffic_timeout: 3600 # Seconds before scaling down
scale_from_zero: false
Using Existing Endpoints#
To evaluate against an already-deployed Lepton endpoint without creating a new deployment, use deployment.type: none
and provide the endpoint URL in the target.api_endpoint
section.
Refer to examples/lepton_none_llama_3_1_8b_instruct.yaml
for a complete example.
Advanced Configuration#
Environment Variables#
Pass environment variables to deployment containers through lepton_config.envs
:
deployment:
lepton_config:
envs:
HF_TOKEN:
value_from:
secret_name_ref: "HUGGING_FACE_HUB_TOKEN"
CUSTOM_VAR: "direct_value"
Storage Mounts#
Configure persistent storage for model caching:
deployment:
lepton_config:
mounts:
enabled: true
cache_path: "/path/to/storage"
mount_path: "/opt/nim/.cache"
Monitoring and Management#
Check Evaluation Status#
Use NeMo Evaluator Launcher commands to monitor your evaluations:
# Check status using invocation ID
nv-eval status <invocation_id>
# Kill running evaluations and cleanup endpoints
nv-eval kill <invocation_id>
Monitor Lepton Resources#
Use Lepton AI CLI commands to monitor platform resources:
# List all deployments in your workspace
lepton deployment list
# Get details about a specific deployment
lepton deployment get <deployment-name>
# View deployment logs
lepton deployment logs <deployment-name>
# Check resource availability
lepton resource list --available
Refer to the Lepton AI CLI documentation for the complete command reference.
Exporting Results#
After evaluation completes, export results using the export command:
# Export results to MLflow
nv-eval export <invocation_id> --dest mlflow
Refer to the Exporters for additional export options and configurations.
Troubleshooting#
Common Issues#
Deployment Timeout:
If endpoints take too long to become ready, check deployment logs:
# Check deployment logs via Lepton CLI
lepton deployment logs <deployment-name>
# Increase readiness timeout in configuration
# (in execution.lepton_platform.deployment.endpoint_readiness_timeout)
Resource Unavailable:
If your requested resource shape is unavailable:
# Check available resources in your workspace
lepton resource list --available
# Try a different resource shape in your config
Authentication Issues:
# Re-authenticate with Lepton
lep login
Endpoint Not Found:
If evaluation jobs cannot connect to the endpoint:
Verify endpoint is in “Ready” state using
lepton deployment get <deployment-name>
Confirm the endpoint URL is accessible
Verify API tokens are properly set in Lepton secrets
Next Steps#
Compare with Slurm Deployment via Launcher for HPC cluster deployments
Explore Local Execution for local development and testing
Review complete configuration examples in the
examples/
directory