Getting Started#

Choose the components that fit your needs. Here are common adoption paths:

Full Stack Deployment#

  1. Infrastructure: Deploy GPU Operator and Network Operator to your Kubernetes cluster

  2. Containers: Pull optimized containers from nvcr.io

  3. Optimize: Use Model Optimizer with TensorRT or TensorRT-LLM

  4. Plan: Use AIConfigurator to estimate performance and plan deployment topology

  5. Deploy: Use KAI Scheduler (add Grove for multinode) to deploy Triton or Dynamo

  6. Tune: Use AIPerf for benchmarking, Planner for runtime optimization

Traditional ML Inference Only#

  1. Optimize: Use TensorRT to optimize your models

  2. Serve: Deploy with Triton Inference Server

  3. Optional: Add DALI for GPU-accelerated preprocessing

GenAI/LLM Inference Only#

  1. Optimize: Use TensorRT-LLM to optimize your LLM

  2. Serve: Deploy with Dynamo

  3. Scale: Add KV Block Manager, NIXL, and Router for distributed inference

Kubernetes Integration Only#

  1. Deploy: GPU Operator + Network Operator for infrastructure management

  2. Schedule: KAI Scheduler for GPU-aware scheduling

  3. Scale: Add Grove for gang scheduling if needed