Deploy NIM for Llama 3.1 8B Instruct#

Start from deploying a NIM for Llama 3.1 8B Instruct. You’ll use this NIM for evaluation and inference tasks in the subsequent tutorials.

To deploy the NIM, run the NeMo Deployment Management API as follows:

Python SDK

from nemo_microservices import NeMoMicroservices

client = NeMoMicroservices(
   base_url="http://nemo.test",
   inference_base_url="http://nim.test",
)

deployment = client.deployment.model_deployments.create(
    name="llama-3.1-8b-instruct",
    namespace="meta",
    config={
        "model": "meta/llama-3.1-8b-instruct",
        "nim_deployment": {
            "image_name": "nvcr.io/nim/meta/llama-3.1-8b-instruct",
            "image_tag": "1.8",
            "pvc_size": "25Gi",
            "gpu": 1,
            "additional_envs": {
                "NIM_GUIDED_DECODING_BACKEND": "fast_outlines"
            }
        }
    }
)
print(deployment)

cURL

curl --location "http://nemo.test/v1/deployment/model-deployments" \
   -H 'accept: application/json' \
   -H 'Content-Type: application/json' \
   -d '{
      "name": "llama-3.1-8b-instruct",
      "namespace": "meta",
      "config": {
         "model": "meta/llama-3.1-8b-instruct",
         "nim_deployment": {
            "image_name": "nvcr.io/nim/meta/llama-3.1-8b-instruct",
            "image_tag": "1.8",
            "pvc_size":   "25Gi",
            "gpu":       1,
            "additional_envs": {
               "NIM_GUIDED_DECODING_BACKEND": "fast_outlines"
            }
         }
      }
   }'

This returns an indicator that the deployment is pending. It takes about 10 minutes for the NIM to fully deploy.

Check the NIM status until it shows ready. Use the following command to verify the status.

Python SDK

# Using the deployment object from the previous step
deployment_status = client.deployment.model_deployments.retrieve(
    namespace=deployment.namespace,
    deployment_name=deployment.name
)
print(deployment_status)

cURL

curl --location "http://nemo.test/v1/deployment/model-deployments/meta/llama-3.1-8b-instruct" | jq

You can deploy multiple NIM in the same manner, as long as you have enough GPU resources. You can access all the deployed NIM through the NIM Proxy microservice under the nim.test host name. If you don’t have enough GPUs, you may need to remove a NIM. For example, if you have one GPU available for inference, you need to remove the deployed NIM before deploying another. To delete a deployed NIM microservice, follow the instructions at Delete Deployed Models.