Deploy NIM for Llama 3.1 8B Instruct#

Start from deploying a NIM for Llama 3.1 8B Instruct. You’ll use this NIM for evaluation and inference tasks in the subsequent tutorials.

  1. To deploy the NIM, run the NeMo Deployment Management API as follows:

    curl --location "http://nemo.test/v1/deployment/model-deployments" \
       -H 'accept: application/json' \
       -H 'Content-Type: application/json' \
       -d '{
          "name": "llama-3.1-8b-instruct",
          "namespace": "meta",
          "config": {
             "model": "meta/llama-3.1-8b-instruct",
             "nim_deployment": {
                "image_name": "nvcr.io/nim/meta/llama-3.1-8b-instruct",
                "image_tag": "1.8",
                "pvc_size":   "25Gi",
                "gpu":       1,
                "additional_envs": {
                   "NIM_GUIDED_DECODING_BACKEND": "fast_outlines"
                }
             }
          }
       }'
    

    This returns an indicator that the deployment is pending. It takes about 10 minutes for the NIM to fully deploy.

    Example Output
    {"async_enabled":false,"config":{"model":"meta/llama-3.1-8b-instruct","nim_deployment":{"additional_envs":{"NIM_GUIDED_DECODING_BACKEND":"fast_outlines"},"gpu":1,"image_name":"nvcr.io/nim/meta/llama-3.1-8b-instruct","image_tag":"1.8"}},"created_at":"2025-04-01T21:38:59.494256552Z","deployed":false,"name":"llama-3.1-8b-instruct","namespace":"meta","status_details":{"description":"Model deployment created","status":"pending"},"url":""}
    
  2. Check the NIM status until it shows ready. Use the following command to verify the status.

    curl --location "http://nemo.test/v1/deployment/model-deployments/meta/llama-3.1-8b-instruct" | jq
    
    Example Output
    {
      "async_enabled": false,
      "config": {
        "model": "meta/llama-3.2-1b-instruct",
        "nim_deployment": {
          "additional_envs": {
            "NIM_GUIDED_DECODING_BACKEND": "fast_outlines"
          },
          "gpu": 1,
          "image_name": "nvcr.io/nim/meta/llama-3.1-8b-instruct",
          "image_tag": "1.8"
        }
      },
      "created_at": "2025-04-01T20:49:20.467335766Z",
      "deployed": true,
      "name": "llama-3.1-8b-instruct",
      "namespace": "meta",
      "status_details": {
        "description": "deployment \"modeldeployment-meta-llama-3-2-1b-instruct\" successfully rolled out\n",
        "status": "ready"
      },
      "url": ""
    }
    

You can deploy multiple NIM in the same manner, as long as you have enough GPU resources. You can access all the deployed NIM through the NIM Proxy microservice under the nim.test host name. If you don’t have enough GPUs, you may need to remove a NIM. For example, if you have one GPU available for inference, you need to remove the deployed NIM before deploying another. To delete a deployed NIM microservice, follow the instructions at Delete Deployed Models.