NVIDIA NIM for Large Language Models#

NVIDIA NIM for LLMs

Introduction
- High Performance Features
- Applications
- Architecture
  - NIM Deployment Lifecycle
- The NVIDIA Developer Program
Release Notes
- Release 1.4.0
- Release 1.3.0
- Release 1.2.3
- Release 1.2.1
- Release 1.2.0
- Release 1.1.2
- Release 1.1.1
  - Summary
  - Known Issues
- Release 1.1.0
- Release 1.0
Getting Started
- Prerequisites
  - Setup
    - Installing WSL2 for Windows
- Launch NVIDIA NIM for LLMs
  - Option 1: From API Catalog
  - Option 2: From NGC
- Docker Run Parameters
- Run Inference
- Stopping the container
- Kubernetes Installation
- Serving models from local assets
Deployment Guide
- Deploying on other platforms
Air Gap Deployment
- Air Gap Deployment (offline cache route)
- Air Gap Deployment (local model directory route)
Multi-node Deployment
- Multi-Node Deployment with Kubernetes
Deploying with Helm
- Prerequisites
- Configuring helm
  - Minimal example
- Storage
- Multi-node Models
  - LeaderWorkerSet
  - MPI Job
- Enabling Open Telemetry Tracing and Metrics
- Launching NIM in Kubernetes
- Running inference
- Troubleshooting FAQ
- Additional information
- Parameters
Tutorials
- Playbooks
- Platform Deployment Guides
Configuring a NIM
- GPU Selection
  - How many GPUs do I need?
- Shared memory flag
- Environment Variables
- Volumes
Model Profiles
- Profile Selection
- Automatic Profile Selection
- Profile Details
  - Optimized Profiles vs. Local Build Profiles
  - Optimization Targets
- Quantization
Benchmarking
Models
Support Matrix
- Hardware
  - CPU
  - GPU
- Software
- GPUs
- Supported Models
Examples with system role
- Message roles
  - OpenAI Chat Completion Request with Single User Question
  - OpenAI Chat Completion Request with Additional Context and Response
API Reference
- OpenAPI Schema
- Experimental APIs
  - Experimental support for Llama Stack (LS) API
- Reference
Function Calling
- Supported Models
- Parameters
  - tool_choice options
- Example Workflows
Using Reward Models
Llama Stack API (Experimental)
- Installation
- Common Components
- Basic Usage
- Streaming Responses
- Tool Calling
Utilities
- List available model profiles
  - Example
- Download model profiles to NIM cache
  - Example
- Create model store
  - Example
- Check NIM cache
  - Example
- Set cache environment variables
  - Example
Fine-tuned model support
- Usage
Observability
- Prometheus
- Grafana
Structured Generation
- JSON Schema
- Regular Expressions
- Choices
- Context-free Grammar
Parameter-Efficient Fine-Tuning
- LoRA Setup Overview
- LoRA Adapters
  - NeMo Format
  - Hugging Face Transformers Format
- LoRA Model Directory Structure
- Obtaining LoRA models
- PEFT Environment Variables
  - PEFT Caching and Dynamic Mixed-batch LoRA (Multi-LoRA)
    - PEFT Cache memory requirements
- Launch NIM for LLMs with PEFT
  - Run Multi-LoRA Inference
KV Cache Reuse (a.k.a. prefix caching)
- How to use
- When to use
Acknowledgements
Eula