Introduction - NVIDIA Docs

NVIDIA NIM is a set of easy-to-use microservices designed to accelerate the deployment of generative AI models across the cloud, data center, and workstations. NIM’s are categorized by model family and a per model basis. For example, NVIDIA NIM for large language models (LLMs) brings the power of state-of-the-art LLMs to enterprise applications, providing unmatched natural language processing and understanding capabilities.

NIM makes it easy for IT and DevOps teams to self-host large language models (LLMs) in their own managed environments while still providing developers with industry standard APIs that allow them to build powerful copilots, chatbots, and AI assistants that can transform their business. Leveraging NVIDIA’s cutting-edge GPU acceleration and scalable deployment, NIM offers the fastest path to inference with unparalleled performance.

High Performance Features

NIM abstracts away model inference internals such as execution engine and runtime operations. They are also the most performant option available whether it be with TRT-LLM, vLLM or others. NIM offers the following high performance features:

Scalable Deployment that is performant and can easily and seamlessly scale from a few users to millions.

Advanced Language Model support with pre-generated optimized engines for a diverse range of cutting edge LLM architectures.

Flexible Integration to easily incorporate the microservice into existing workflows and applications. Developers are provided with an OpenAI API compatible programming model and custom NVIDIA extensions for additional functionality.

Enterprise-Grade Security emphasizes security by using safetensors, constantly monitoring and patching CVEs in our stack and conducting internal penetration tests.

Applications

Chatbots & Virtual Assistants: Empower bots with human-like language understanding and responsiveness.

Content Generation & Summarization: Generate high-quality content or distill lengthy articles into concise summaries with ease.

Sentiment Analysis: Understand user sentiments in real-time, driving better business decisions.

Language Translation: Break language barriers with efficient and accurate translation services.

And many more… The potential applications of NIM are vast, spanning across various industries and use-cases.

Architecture

NIMs are packaged as container images on a per model/model family basis. Each NIM is its own Docker container with a model, such as meta/llama3-8b-instruct. These containers include a runtime that runs on any NVIDIA GPU with sufficient GPU memory, but some model/GPU combinations are optimized. NIM automatically downloads the model from NGC, leveraging a local filesystem cache if available. Each NIM is built from a common base, so once a NIM has been downloaded, downloading additional NIMs is extremely fast.

When a NIM is first deployed, NIM inspects the local hardware configuration, and the available optimized model in the model registry, and then automatically chooses the best version of the model for the available hardware. For a subset of NVIDIA GPUs (see Support Matrix), NIM downloads the optimized TRT engine and runs an inference using the TRT-LLM library. For all other NVIDIA GPUs, NIM downloads a non-optimized model and runs it using the vLLM library.

NIMs are distributed as NGC container images through the NVIDIA NGC Catalog. A security scan report is available for each container within the NGC catalog, which provides a security rating of that image, breakdown of CVE severity by package, and links to detailed information on CVEs.

NIM Deployment Lifecycle

flowchart TD A[User runs NIM<br/>`docker run ...`] --> B[Docker container downloads] B --> C{Is model on\nlocal filesystem?} C -->|No| D[Download model from NGC] C -->|Yes| E[Run the model] D --> E E --> F[Start OpenAI compliant<br/>Completions REST API server]