Getting Started
Choose the components that fit your needs. Here are common adoption paths:
Full Stack Deployment
- Infrastructure: Deploy GPU Operator and Network Operator to your Kubernetes cluster
- Containers: Pull optimized containers from nvcr.io
- Optimize: Use Model Optimizer with TensorRT or TensorRT-LLM
- Plan: Use AIConfigurator to estimate performance and plan deployment topology
- Deploy: Use KAI Scheduler (add Grove for multinode) to deploy Triton or Dynamo
- Tune: Use AIPerf for benchmarking, Planner for runtime optimization
Traditional ML Inference Only
- Optimize: Use TensorRT to optimize your models
- Serve: Deploy with Triton Inference Server
- Optional: Add DALI for GPU-accelerated preprocessing
GenAI/LLM Inference Only
- Optimize: Use TensorRT-LLM to optimize your LLM
- Serve: Deploy with Dynamo
- Scale: Add KV Block Manager, NIXL, and Router for distributed inference
Kubernetes Integration Only
- Deploy: GPU Operator + Network Operator for infrastructure management
- Schedule: KAI Scheduler for GPU-aware scheduling
- Scale: Add Grove for gang scheduling if needed