Getting Started#
Choose the components that fit your needs. Here are common adoption paths:
Full Stack Deployment#
Infrastructure: Deploy GPU Operator and Network Operator to your Kubernetes cluster
Containers: Pull optimized containers from nvcr.io
Optimize: Use Model Optimizer with TensorRT or TensorRT-LLM
Plan: Use AIConfigurator to estimate performance and plan deployment topology
Deploy: Use KAI Scheduler (add Grove for multinode) to deploy Triton or Dynamo
Tune: Use AIPerf for benchmarking, Planner for runtime optimization
Traditional ML Inference Only#
Optimize: Use TensorRT to optimize your models
Serve: Deploy with Triton Inference Server
Optional: Add DALI for GPU-accelerated preprocessing
GenAI/LLM Inference Only#
Optimize: Use TensorRT-LLM to optimize your LLM
Serve: Deploy with Dynamo
Scale: Add KV Block Manager, NIXL, and Router for distributed inference
Kubernetes Integration Only#
Deploy: GPU Operator + Network Operator for infrastructure management
Schedule: KAI Scheduler for GPU-aware scheduling
Scale: Add Grove for gang scheduling if needed