Inference Concepts#

The NeMo platform offers two microservices for adding inference capabilities to your Kubernetes cluster: NIM Proxy and NeMo Deployment Management.

NeMo Deployment Management Microservice#

The NeMo Deployment Management microservice provides APIs for managing the lifecycle of NIM for LLMs deployments within a Kubernetes cluster. It provides comprehensive management capabilities, including configuration, deployment, and maintenance of LLM models. The key functionalities of the NeMo Deployment Management microservice include:

  • Model Deployment: Facilitates the deployment of NIM for LLMs by submitting configuration details and deployment requests to the API. This includes specifying model details, container images, resource requirements, and environment variables.

  • Configuration Management: You can create and manage deployment configurations, which can be reused for deploying multiple models or different configurations of the same model.

  • Integration with External Endpoints: Supports model deployments from external endpoints such as OpenAI ChatGPT and NVIDIA Integrate, enabling the use of third-party models within the NeMo platform.

  • Model Deployment Management: Provides APIs to retrieve deployment metadata, update configurations, and delete deployed models, ensuring comprehensive control over the deployment lifecycle.

NIM Proxy Microservice#

The NIM Proxy microservice acts as a central gateway for all deployed NIM for LLMs. It provides a unified endpoint through which all deployed models can be accessed for inference tasks. The key functionalities of the NIM Proxy include:

  • Centralized Model Access: Exposes all deployed NIM for LLMs through a single NeMo host endpoint, simplifying the process of model discovery and inference.

  • Auto-Detection of Models: Automatically detects and lists models that are uploaded by the NeMo Customizer microservice, deployed through the NeMo Deployment Management microservice, or manually labeled within the NIM for LLMs specification.

  • Simplified Inference Requests: You can send inference requests to a unified endpoint, streamlining the process of interacting with multiple models.

Benefits of Using NIM Proxy and NeMo Deployment Management Microservices#

  • Simplified Operations: Centralized endpoints and automated model discovery reduce complexity and streamline operations.

  • Scalability: Efficient management of deployment configurations and resources supports scalable LLM operations.

  • Flexibility: Integration with external endpoints and customizable deployment configurations provide flexibility to meet diverse LLM needs.

In summary, the NIM Proxy and NeMo Deployment Management microservices collectively provide tools for managing and utilizing LLMs within Kubernetes environments, enhancing operational efficiency, scalability, and flexibility for LLM inference tasks.