NVIDIA NIM for Large Language Models#

NVIDIA NIM for LLMs

Introduction
- High Performance Features
- Applications
- Architecture
  - NIM Deployment Lifecycle
Release Notes
- Release 1.1.2
- Release 1.1.1
  - Summary
  - Known Issues
- Release 1.1.0
- Release 1.0
Getting Started
- Prerequisites
  - Setup
    - Installing WSL2 for Windows
- Launch NVIDIA NIM for LLMs
  - Option 1: From API Catalog
  - Option 2: From NGC
- Docker Run Parameters
- Run Inference
- Stopping the container
- Kubernetes Installation
- Serving models from local assets
Tutorials
- Playbooks
- Deployment Guides
Multi-node Deployment
- Multi-Node Deployment with Kubernetes
Deploying with Helm
- Prerequisites
- Configuring helm
  - Minimal example
- Storage
- Multi-node Models
  - LeaderWorkerSet
  - MPI Job
- Launching NIM in Kubernetes
- Running inference
- Troubleshooting FAQ
- Additional information
- Parameters
Configuring a NIM
- GPU Selection
- Shared memory flag
- Environment Variables
- Volumes
Model Profiles
- Profile Selection
- Automatic Profile Selection
- How Profiles are Created
  - Optimization Targets
  - Quantization
Benchmarking
Models
Support Matrix
- Hardware
- Software
- GPUs
- General Guidelines
  - GPUs
  - Disk Space
- Supported Models
API Reference
- OpenAPI Schema
- Experimental APIs
  - Experimental support of LS API
Function Calling
- Supported Models
- Parameters
  - tool_choice options
- Example Workflows
Llama Stack API (Experimental)
- Installation
- Basic Usage
- Streaming Responses
- Tool Calling
Utilities
- List available model profiles
  - Example
- Download model profiles to NIM cache
  - Example
- Create model store
  - Example
- Check NIM cache
  - Example
- Set cache environment variables
  - Example
Observability
- Prometheus
- Grafana
Structured Generation
- JSON Schema
- Regular Expressions
- Choices
- Context-free Grammar
Parameter-Efficient Fine-Tuning
- LoRA Setup Overview
- LoRA Adapters
  - NeMo Format
  - Hugging Face Transformers Format
- LoRA Model Directory Structure
- Obtaining LoRA models
- PEFT Environment Variables
  - PEFT Caching and Dynamic Mixed-batch LoRA (Multi-LoRA)
    - PEFT Cache memory requirements
- Launch NIM for LLMs with PEFT
  - Run Multi-LoRA Inference
Acknowledgements
- accelerate
- aiohttp
- aiosignal
- annotated-types
- anyio
- apscheduler
- async-timeout
- attrs
- build
- certifi
- charset-normalizer
- click
- cmake
- coloredlogs
- datasets
- dill
- diskcache
- dnspython
- einops
- email_validator
- exceptiongroup
- fastapi
- fastapi-cli
- filelock
- flash-attn
- frozenlist
- fsspec
- h11
- h5py
- httpcore
- httptools
- httpx
- huggingface-hub
- humanfriendly
- idna
- importlib_metadata
- interegular
- jinja2
- joblib
- jsonschema
- jsonschema-specifications
- lark
- llvmlite
- lm-format-enforcer
- markdown-it-py
- markupsafe
- mdurl
- ml-dtypes
- mpi4py
- mpmath
- msgpack
- multidict
- multiprocess
- nest-asyncio
- networkx
- ninja
- numba
- numpy
- onnx
- optimum
- orjson
- outlines
- packaging
- pandas
- polygraphy
- prometheus_client
- protobuf
- psutil
- pulp
- py-cpuinfo
- pyarrow
- pyarrow-hotfix
- pydantic
- pydantic_core
- pygments
- pynvml
- python-dateutil
- python-dotenv
- python-multipart
- pytz
- pyyaml
- ray
- referencing
- regex
- requests
- rich
- rpds-py
- scipy
- shellingham
- six
- sniffio
- starlette
- strenum
- sympy
- tensorstore
- tiktoken
- tomli
- torch
- tqdm
- transformers
- typer
- typing_extensions
- tzdata
- tzlocal
- ujson
- urllib3
- uvicorn
- uvloop
- vllm
- watchfiles
- websockets
- xformers
- xxhash
- yarl
- zipp
Eula