NVIDIA NIM for Large Language Models#

NVIDIA NIM for LLMs

Introduction
- High Performance Features
- Applications
- Architecture
  - NIM Deployment Lifecycle
- The NVIDIA Developer Program
Release Notes
- All Known Issues
  - 1.7.0
  - 1.6.0
  - 1.5.1 RTX
  - 1.5.0
  - 1.4.0
  - 1.3.0
  - 1.2.3
  - 1.2.1
  - 1.2.0
  - 1.1.2
  - 1.1.1
  - 1.1.0
  - 1.0
- Release 1.7.1
  - New Features
- Release 1.7.0
- Release 1.6.0
- Release 1.5.1 RTX
  - New Language Models
- Release 1.5.0
- Release 1.4.0
- Release 1.3.0
- Release 1.2.3
  - New Language Models
  - Known Issues
- Release 1.2.1
  - New Models
  - Known Issues
- Release 1.2.0
- Release 1.1.2
- Release 1.1.1
  - Known Issues
- Release 1.1.0
- Release 1.0
Getting Started
- Prerequisites
- Launch NVIDIA NIM for LLMs
  - Option 1: From API Catalog
  - Option 2: From NGC
- Docker Run Parameters
- Run Inference
- Stopping the container
- Kubernetes Installation
- Serving models from local assets
Deployment Guide
- Deploying on other platforms
Air Gap Deployment
- Air Gap Deployment (offline cache route)
- Air Gap Deployment (local model directory route)
Multi-Node Deployment
- Multi-Node Deployment with Kubernetes
- Preparation
- Multi-Node Models
  - LeaderWorkerSet
  - MPI Job
- Example: Helm chart for DeepSeek R1 using an SGLang Backend
  - Why are GLOO_SOCKET_IFNAME and NCCL_SOCKET_IFNAME needed?
  - Parallelism Strategy
- Troubleshooting Multi-Node Deployments
- Multi-Node Parameters
Deploying with Helm
- Prerequisites
- Configuring Helm
  - Minimal Example
- Storage
- Enabling Open Telemetry Tracing and Metrics
- Launching NIM in Kubernetes
- Running Inference
- Troubleshooting FAQ
- Additional Information
- Parameters
Tutorials
- Playbooks
- Platform Deployment Guides
Configuring a NIM
- GPU Selection
  - How many GPUs do I need?
- Shared memory flag
- Environment Variables
- Volumes
Model Profiles
- Profile Selection
- Automatic Profile Selection
- Profile Details
  - Optimized Profiles vs. Local Build Profiles
  - Optimization Targets
- Quantization
Overview
- Structure
  - Supported Repository Types
- S3 Support
  - Using the Feature
Benchmarking
Models
Supported Models
- GPUs
- Optimized Models
Examples with system role
- Message roles
  - OpenAI Chat Completion Request with Single User Question
  - OpenAI Chat Completion Request with Additional Context and Response
API Reference
- OpenAPI Schema
- Experimental APIs
  - Experimental support for Llama Stack (LS) API
- Reference
Function Calling
- Supported Models
- Parameters
  - tool_choice options
- Example Workflows
Using Reward Models
Llama Stack API (Experimental)
- Installation
- Common Components
- Basic Usage
- Streaming Responses
- Tool Calling
Utilities
- List available model profiles
  - Example
- Download model profiles to NIM cache
  - Example
- Create model store
  - Example
- Check NIM cache
  - Example
- Set cache environment variables
  - Example
Fine-tuned model support
- Usage
Observability
- Prometheus
- Grafana
Structured Generation
- JSON Schema
- Regular Expressions
- Choices
- Context-free Grammar
Custom Guided Decoding Backend (Experimental)
- Loading the custom guided decoding backends
- Specifying a custom backend at runtime
- Custom guided decoding backend specifications
Parameter-Efficient Fine-Tuning
- LoRA Setup Overview
- LoRA Adapters
  - NeMo Format
  - Hugging Face Transformers Format
- LoRA Model Directory Structure
- Obtaining LoRA models
- PEFT Environment Variables
  - PEFT Caching and Dynamic Mixed-batch LoRA (Multi-LoRA)
    - PEFT Cache memory requirements
- Launch NIM for LLMs with PEFT
  - Run Multi-LoRA Inference
KV Cache Reuse (a.k.a. prefix caching)
- How to use
- When to use
Acknowledgements
Eula