Configure the NIMs#

The LLM NIM can be easily swapped with different sized models or different versions.

The default docker compose local deployment will launch the Llama 3.1 70b NIM to use as the LLM, but you may want to use a different LLM depending on your specific needs. This can be changed to use a different LLM NIM by modifying the docker run command.

Refer the snippet below to add your API Key and run the command to launch different LLM NIM on GPUs 1 and 2.

export NGC_API_KEY=<PASTE_API_KEY_HERE>
export LOCAL_NIM_CACHE=~/.cache/nim
mkdir -p "$LOCAL_NIM_CACHE"
docker run -d -it --rm \
--gpus '"device=1,2"' \
--shm-size=16GB \
-e NGC_API_KEY \
-v "$LOCAL_NIM_CACHE:/opt/nim/.cache" \
-u $(id -u) \
-p 8000:8000 \
nvcr.io/nim/meta/llama-3.1-8b-instruct:1.3.3

The first time this command is run, it will take some time to download and deploy the model. Ensure the LLM works by running a sample curl command.

curl -X 'POST' \
   'http://0.0.0.0:8000/v1/chat/completions' \
   -H 'accept: application/json' \
   -H 'Content-Type: application/json' \
   -d '{
      "model": "meta/llama-3.1-8b-instruct",
      "messages": [{"role":"user", "content":"Write a limerick about the wonders of GPU computing."}],
      "max_tokens": 64
   }'

Similarly, CA-RAG config and guardrails/config.yml can be modified to use different models as shown below:

CA-RAG config:

summarization:
   llm:
      model: "meta/llama-3.1-8b-instruct"
      base_url: "http://host.docker.internal:8000/v1" #FIXME - update url to running LLM Instance
...
chat:
   llm:
      model: "meta/llama-3.1-8b-instruct"
      base_url: "http://host.docker.internal:8000/v1" #FIXME - update url to running LLM Instance
...
notification:
   llm:
      model: "meta/llama-3.1-8b-instruct"
      base_url: "http://host.docker.internal:8000/v1" #FIXME - update url to running LLM Instance

Guardrails config:

models:
   engine: nim
   model: meta/llama-3.1-8b-instruct
   parameters:
      base_url: "http://host.docker.internal:8000/v1"