6. Inference Examples#
In this section, we give examples for deploying NVIDIA NIM on your DGX Cloud Create cluster. The example uses llama-3.1-8b-instruct model but it is similar for other models.
6.1. Using the UI#
6.1.1. Prerequisites and Requirements#
A NVIDIA Run:ai project namespace identified for use (either new or existing)
Credentials with access to the NIM on NGC added to the NVIDIA Run:ai Cluster. See details here.
Note
Both Docker registry and Generic secret Credentials will be needed for this example. NIM container images will be pulled using the Docker registry credentials, and the NIM will access resources on NGC using the Generic secret credentials.
6.1.2. Create an Environment#
Create separate Environment per model.
To create a new environment using the NVIDIA Run:ai UI:
In the left navigation menu, select Environments. You will be taken to the Environments overview page which shows information on existing environments in the cluster.
Click + NEW ENVIRONMENT in the top left of the page. You will be taken to the New environment creation page.
Select a Scope for the environment. This determines which clusters, departments, groups or projects can deploy the environment.
Enter an Environment name and description. Example
llama3-8b-nim
Insert the URL for a container image. Example
nvcr.io/nim/meta/llama-3.1-8b-instruct
Select Inference as a type of the workload.
Add HTTP port 8000 as inference serving endpoint.
In the runtime settings add a new environment variable. Type
NGC_API_KEY
for the name. Select the Credentials source, select your key with your NGC credentials added for the Credential name, then typeNGC_API_KEY
for the Secret Key.Click CREATE ENVIRONMENT. You will be taken back to the Environments overview page where your environment is listed in the table.
6.1.3. Optional: Create a PVC Datastore#
You can create a PVC Datastore for model caching. Refer to PVC for more information.
6.1.4. Create Workload#
Go to the Workloads overview page and click the + New Workload button in the top left. A drop-down menu will appear. From the drop-down menu, select Inference. You will be taken to the New inference creation page.
Select the desired cluster to run your NIM.
Select the desired project to run your job in.
Add Inference Name - Example:
nim-llama3
Click CONTINUE in the bottom right of the page.
Select the Environment you created. Make sure the NGC credentals are properly populated as environment variable.
Select a compute resource with one GPU.
Optional: Increase the maximum number of replicas for autoscaling. If you do so additional autoscaling options will appear.
Optional: In the Data Sources section, select your PVC.
Go to the bottom of the page and click CREATE.
Optional: Monitor the NIM Pod logs
$ kubectl get pod NAME READY STATUS RESTARTS AGE nim-llama3-00001-deployment-5ffcfd48c8-kbmbg 2/2 Running 0 6m2s $ kubectl logs nim-llama3-00001-deployment-5ffcfd48c8-kbmbg -f
Get the external URL that the NIM is exposed. The URL is the value in the second column, in this case
https://nim-llama3-runai-nemo-ms.inference.<cluster>.ai
.kubectl get ksvc NAME URL LATESTCREATED LATESTREADY READY REASON nim-llama3 https://nim-llama3-runai-nemo-ms.inference.<cluster>.ai nim-llama3-yasen-00001 nim-llama3-yasen-00001 True
Test the service
curl -X POST https://nim-llama3-runai-nemo-ms.inference.<cluster>.ai/v1/chat/completions -H 'accept: application/json' -H 'Content-Type: application/json' -d '{ "messages": [ { "content": "You are a polite and respectful chatbot helping people plan a vacation.", "role": "system" }, { "content": "What should I do for a 4 day vacation in Spain?", "role": "user" } ], "model": "meta/llama-3.1-8b-instruct", "max_tokens": 16, "top_p": 1, "n": 1, "stream": false, "stop": "\n", "frequency_penalty": 0.0}' |jq
The expected result is:
{ "id": "chat-69e0148a09344875a1cfff1919f3aa46", "object": "chat.completion", "created": 1731544970, "model": "meta/llama-3.1-8b-instruct", "choices": [ { "index": 0, "message": { "role": "assistant", "content": "Spain is a wonderful destination! With 4 days, you'll have a great" }, "logprobs": null, "finish_reason": "length", "stop_reason": null } ], "usage": { "prompt_tokens": 43, "total_tokens": 59, "completion_tokens": 16 } }