Getting Started

View as Markdown

Choose the components that fit your needs. Here are common adoption paths:

Full Stack Deployment

  1. Infrastructure: Deploy GPU Operator and Network Operator to your Kubernetes cluster
  2. Containers: Pull optimized containers from nvcr.io
  3. Optimize: Use Model Optimizer with TensorRT or TensorRT-LLM
  4. Plan: Use AIConfigurator to estimate performance and plan deployment topology
  5. Deploy: Use KAI Scheduler (add Grove for multinode) to deploy Triton or Dynamo
  6. Tune: Use AIPerf for benchmarking, Planner for runtime optimization

Traditional ML Inference Only

  1. Optimize: Use TensorRT to optimize your models
  2. Serve: Deploy with Triton Inference Server
  3. Optional: Add DALI for GPU-accelerated preprocessing

GenAI/LLM Inference Only

  1. Optimize: Use TensorRT-LLM to optimize your LLM
  2. Serve: Deploy with Dynamo
  3. Scale: Add KV Block Manager, NIXL, and Router for distributed inference

Kubernetes Integration Only

  1. Deploy: GPU Operator + Network Operator for infrastructure management
  2. Schedule: KAI Scheduler for GPU-aware scheduling
  3. Scale: Add Grove for gang scheduling if needed