NIM Deployment#
NIM (NVIDIA Inference Microservices) provides optimized inference microservices with OpenAI-compatible application programming interfaces. NIM deployments automatically handle model optimization, scaling, and resource management on supported platforms.
Configuration#
Basic Settings#
deployment:
image: nvcr.io/nim/meta/llama-3.1-8b-instruct:1.8.6
served_model_name: meta/llama-3.1-8b-instruct
port: 8000
image
: NIM container image from NVIDIA NIM Containers (required)served_model_name
: Name used for serving the model (required)port
: Port for the NIM server (default: 8000)
Endpoints#
endpoints:
chat: /v1/chat/completions
completions: /v1/completions
health: /health
Integration with Lepton#
NIM deployment with Lepton executor:
defaults:
- execution: lepton/default
- deployment: nim
- _self_
deployment:
image: nvcr.io/nim/meta/llama-3.1-8b-instruct:1.8.6
served_model_name: meta/llama-3.1-8b-instruct
# Platform-specific settings
lepton_config:
endpoint_name: nim-llama-3-1-8b-eval
resource_shape: gpu.1xh200
# ... additional platform settings
Environment Variables#
Configure environment variables for NIM container operation:
deployment:
lepton_config:
envs:
HF_TOKEN:
value_from:
secret_name_ref: "HUGGING_FACE_HUB_TOKEN"
Auto-populated Variables:
The launcher automatically sets these environment variables from your deployment configuration:
SERVED_MODEL_NAME
: Set fromdeployment.served_model_name
NIM_MODEL_NAME
: Set fromdeployment.served_model_name
MODEL_PORT
: Set fromdeployment.port
(default: 8000)
Resource Management#
Auto-scaling Configuration#
deployment:
lepton_config:
min_replicas: 1
max_replicas: 3
auto_scaler:
scale_down:
no_traffic_timeout: 3600
scale_from_zero: false
target_gpu_utilization_percentage: 0
target_throughput:
qpm: 2.5
Storage Mounts#
Enable model caching for faster startup:
deployment:
lepton_config:
mounts:
enabled: true
cache_path: "/path/to/model/cache"
mount_path: "/opt/nim/.cache"
Security Configuration#
API Tokens#
deployment:
lepton_config:
api_tokens:
- value: "UNIQUE_ENDPOINT_TOKEN"
Image Pull Secrets#
execution:
lepton_platform:
tasks:
image_pull_secrets:
- "lepton-nvidia-registry-secret"
Complete Example#
defaults:
- execution: lepton/default
- deployment: nim
- _self_
execution:
output_dir: lepton_nim_llama_3_1_8b_results
deployment:
image: nvcr.io/nim/meta/llama-3.1-8b-instruct:1.8.6
served_model_name: meta/llama-3.1-8b-instruct
lepton_config:
endpoint_name: llama-3-1-8b
resource_shape: gpu.1xh200
min_replicas: 1
max_replicas: 3
api_tokens:
- value_from:
token_name_ref: "ENDPOINT_API_KEY"
envs:
HF_TOKEN:
value_from:
secret_name_ref: "HUGGING_FACE_HUB_TOKEN"
mounts:
enabled: true
cache_path: "/path/to/model/cache"
mount_path: "/opt/nim/.cache"
evaluation:
tasks:
- name: ifeval
Examples#
Refer to packages/nemo-evaluator-launcher/examples/lepton_nim_llama_3_1_8b_instruct.yaml
for a complete NIM deployment example.