Deploy LLMs as NIM Microservices#

You can deploy LLMs as NIM microservices through the NeMo Deployment Management API. This tutorial shows how to deploy the meta/llama-3.1-8b-instruct model. You’ll use the deployed NIM microservice for fine-tuning, evaluation, and inference tasks by using the capabilities of the NeMo microservices platform in the subsequent tutorials.

Prerequisites#

Before you begin, complete the following prerequisites:

Deploy a NIM Microservice#

Start deploying the meta/llama-3.1-8b-instruct NIM microservice. You’ll use this NIM microservice for evaluation and inference tasks in the subsequent tutorials.

Note

NIM image versions 1.10 and later use "NIM_GUIDED_DECODING_BACKEND": "outlines" instead of fast_outlines.

  1. To deploy the NIM, run the NeMo Deployment Management API as follows:

    from nemo_microservices import NeMoMicroservices
    
    client = NeMoMicroservices(
       base_url="http://nemo.test",
       inference_base_url="http://nim.test",
    )
    
    deployment = client.deployment.model_deployments.create(
        name="llama-3.1-8b-instruct",
        namespace="meta",
        config={
            "model": "meta/llama-3.1-8b-instruct",
            "nim_deployment": {
                "image_name": "nvcr.io/nim/meta/llama-3.1-8b-instruct",
                "image_tag": "1.8",
                "pvc_size": "25Gi",
                "gpu": 1,
                "additional_envs": {
                    "NIM_GUIDED_DECODING_BACKEND": "outlines"
                }
            }
        }
    )
    print(deployment)
    
    curl --location "http://nemo.test/v1/deployment/model-deployments" \
       -H 'accept: application/json' \
       -H 'Content-Type: application/json' \
       -d '{
          "name": "llama-3.1-8b-instruct",
          "namespace": "meta",
          "config": {
             "model": "meta/llama-3.1-8b-instruct",
             "nim_deployment": {
                "image_name": "nvcr.io/nim/meta/llama-3.1-8b-instruct",
                "image_tag": "1.8",
                "pvc_size":   "25Gi",
                "gpu":       1,
                "additional_envs": {
                   "NIM_GUIDED_DECODING_BACKEND": "outlines"
                }
             }
          }
       }'
    

    This returns an indicator that the deployment is pending as shown in the following output example. It might take about 10 minutes for the NIM to fully deploy.

    Example Output
    {
       "async_enabled": false,
       "config": {
          "model": "meta/llama-3.1-8b-instruct",
          "nim_deployment": {
             "additional_envs": {
                   "NIM_GUIDED_DECODING_BACKEND": "outlines"
             },
             "gpu": 1,
             "image_name": "nvcr.io/nim/meta/llama-3.1-8b-instruct",
             "image_tag": "1.8"
          }
       },
       "created_at": "2025-04-01T21:38:59.494256552Z",
       "deployed": false,
       "name": "llama-3.1-8b-instruct",
       "namespace": "meta",
       "status_details": {
          "description": "Model deployment created",
          "status": "pending"
       },
       "url": ""
    }
    
  2. Check the NIM status until it shows ready. Use the following command to verify the status. This requires the deployment information from the previous step so should be run in the same session.

    # Using the deployment object from the previous step
    deployment_status = client.deployment.model_deployments.retrieve(
       namespace=deployment.namespace,
       deployment_name=deployment.name
    )
    print(deployment_status)
    
    curl --location "http://nemo.test/v1/deployment/model-deployments/meta/llama-3.1-8b-instruct" | jq
    
    Example Output
    {
       "async_enabled": false,
       "config": {
          "model": "meta/llama-3.1-8b-instruct",
          "nim_deployment": {
             "additional_envs": {
                "NIM_GUIDED_DECODING_BACKEND": "outlines"
             },
             "gpu": 1,
             "image_name": "nvcr.io/nim/meta/llama-3.1-8b-instruct",
             "image_tag": "1.8"
          }
       },
       "created_at": "2025-04-01T20:49:20.467335766Z",
       "deployed": true,
       "name": "llama-3.1-8b-instruct",
       "namespace": "meta",
       "status_details": {
          "description": "Deployment \"modeldeployment-meta-llama-3-1-8b-instruct\" successfully rolled out.",
          "status": "ready"
       },
       "url": ""
    }
    
  3. When deployment is complete, you can interact with it as you would with a standard NIM through the NIM Proxy microservice endpoint. For example, you can use the completion endpoint as follows.

Create a NeMoMicroservices client instance using the base URL of the NIM Proxy microservice and perform the task as follows.

import os
from nemo_microservices import NeMoMicroservices

client = NeMoMicroservices(base_url=os.environ["NEMO_BASE_URL"], inference_base_url=os.environ["NIM_PROXY_BASE_URL"])

response = client.chat.completions.create(
    model="meta/llama-3.1-8b-instruct",
    messages=[{"role": "user", "content": "what can you do?"}],
    temperature=0.7,
    max_tokens=100,
    stream=False,
)
for chunk in response:
    print(chunk)

Make a POST request to the /v1/chat/completions endpoint.

For more details on the request body, refer to the NIM for LLMs API reference and find the API named the same as v1/chat/completions. The NIM Proxy API endpoint routes your requests to the NIM for LLMs microservice.

curl -X POST \
  "${NIM_PROXY_BASE_URL}/v1/chat/completions" \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "meta/llama-3.1-8b-instruct",
    "messages": [
        {
            "role":"user",
            "content":"Hello! How are you?"
        }
    ],
    "max_tokens": 32
  }' | jq
Example Response
{
  "id": "chatcmpl-123",
  "object": "chat.completion",
  "created": 1677652288,
  "model": "meta/llama-3.1-8b-instruct",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "I'm doing well, thank you for asking! How can I help you today?"
      },
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 15,
    "completion_tokens": 12,
    "total_tokens": 27
  }
}

You can deploy multiple NIM in the same manner, as long as you have enough GPU resources. You can access all the deployed NIM through the NIM Proxy microservice under the nim.test host name. If you don’t have enough GPUs, you may need to remove a NIM. For example, if you have one GPU available for inference, you need to remove the deployed NIM before deploying another. To delete a deployed NIM microservice, follow the instructions at Delete Deployed Models.