Using Local LLMs#
NeMo Agent toolkit has the ability to interact with locally hosted LLMs, in this guide we will demonstrate how to adapt the simple example (examples/getting_started/simple_web_query
) to use locally hosted LLMs using two different approaches using NVIDIA NIM and vLLM.
Using NIM#
In the NeMo Agent toolkit simple example the meta/llama-3.1-70b-instruct
model was used. For the purposes of this guide we will be using a smaller model, the nvidia/Llama-3.1-Nemotron-Nano-4B-v1.1
which is more likely to be runnable on a local workstation.
Regardless of the model you choose, the process is the same for downloading the model’s container from build.nvidia.com
. Navigate to the model you wish to run locally, if it is able to be downloaded it will be labeled with the RUN ANYWHERE
tag, the exact commands will be specified on the Deploy
tab for the model.
Requirements#
An NVIDIA GPU with CUDA support (exact requirements depend on the model you are using)
An NVIDIA API key, refer to Obtaining API Keys for more information.
Install the Simple Web Query Example#
First, ensure the current working directory is the root of the NeMo Agent toolkit repository. Then, install the simple web query example so we have the webpage_query
tool available.
pip install -e examples/getting_started/simple_web_query
Downloading the NIM Containers#
Login to nvcr.io with Docker:
$ docker login nvcr.io
Username: $oauthtoken
Password: <PASTE_API_KEY_HERE>
Download the container for the LLM:
docker pull nvcr.io/nim/nvidia/llama3.1-nemotron-nano-4b-v1.1:latest
Download the container for the embedding Model:
docker pull nvcr.io/nim/nvidia/nv-embedqa-e5-v5:latest
Running the NIM Containers#
Note
The --gpus
flag is used to specify the GPUs to use for the LLM and embedding model. Each user’s setup may vary, so adjust the commands to suit the system.
Run the LLM container listening on port 8000:
export NGC_API_KEY=<PASTE_API_KEY_HERE>
export LOCAL_NIM_CACHE=~/.cache/nim
mkdir -p "$LOCAL_NIM_CACHE"
docker run -it --rm \
--gpus 0 \
--shm-size=16GB \
-e NGC_API_KEY \
-v "$LOCAL_NIM_CACHE:/opt/nim/.cache" \
-u $(id -u) \
-p 8000:8000 \
nvcr.io/nim/nvidia/llama3.1-nemotron-nano-4b-v1.1:latest
Open a new terminal and run the embedding model container, listening on port 8001:
export NGC_API_KEY=<PASTE_API_KEY_HERE>
export LOCAL_NIM_CACHE=~/.cache/nim
docker run -it --rm \
--gpus 1 \
--shm-size=16GB \
-e NGC_API_KEY \
-v "$LOCAL_NIM_CACHE:/opt/nim/.cache" \
-u $(id -u) \
-p 8001:8000 \
nvcr.io/nim/nvidia/nv-embedqa-e5-v5:latest
NeMo Agent Toolkit Configuration#
To define the pipeline configuration, we will start with the examples/getting_started/simple_web_query/configs/config.yml
file and modify it to use the locally hosted LLMs, the only changes needed are to define the base_url
for the LLM and embedding models, along with the names of the models to use.
examples/documentation_guides/locally_hosted_llms/nim_config.yml
:
functions:
webpage_query:
_type: webpage_query
webpage_url: https://docs.smith.langchain.com
description: "Search for information about LangSmith. For any questions about LangSmith, you must use this tool!"
embedder_name: nv-embedqa-e5-v5
chunk_size: 512
current_datetime:
_type: current_datetime
llms:
nim_llm:
_type: nim
base_url: "http://localhost:8000/v1"
model_name: nvidia/llama3.1-nemotron-nano-4b-v1.1
embedders:
nv-embedqa-e5-v5:
_type: nim
base_url: "http://localhost:8001/v1"
model_name: nvidia/nv-embedqa-e5-v5
workflow:
_type: react_agent
tool_names: [webpage_query, current_datetime]
llm_name: nim_llm
verbose: true
parse_agent_response_max_retries: 3
Running the NeMo Agent Toolkit Workflow#
To run the workflow using the locally hosted LLMs, run the following command:
nat run --config_file examples/documentation_guides/locally_hosted_llms/nim_config.yml --input "What is LangSmith?"
Using vLLM#
vLLM provides an OpenAI-Compatible Server allowing us to re-use our existing OpenAI clients. If you have not already done so, install vLLM following the Quickstart guide. Similar to the previous example we will be using the same nvidia/Llama-3.1-Nemotron-Nano-4B-v1.1
LLM model. Along with the ssmits/Qwen2-7B-Instruct-embed-base
embedding model.
Install the Simple Web Query Example#
First, ensure the current working directory is the root of the NeMo Agent toolkit repository. Then, install the simple web query example so we have the webpage_query
tool available.
pip install -e examples/getting_started/simple_web_query
Serving the Models#
Similar to the NIM approach we will be running the LLM on the default port of 8000 and the embedding model on port 8001.
Note
The CUDA_VISIBLE_DEVICES
environment variable is used to specify the GPUs to use for the LLM and embedding model. Each user’s setup may vary, so adjust the commands to suit the system.
In a terminal from within the vLLM environment, run the following command to serve the LLM:
CUDA_VISIBLE_DEVICES=0 vllm serve nvidia/Llama-3.1-Nemotron-Nano-4B-v1.1
In a second terminal also from within the vLLM environment, run the following command to serve the embedding model:
CUDA_VISIBLE_DEVICES=1 vllm serve --task embed --override-pooler-config '{"pooling_type": "MEAN"}' --port 8001 ssmits/Qwen2-7B-Instruct-embed-base
Note
The --override-pooler-config
flag is taken from the vLLM Supported Models documentation.
NeMo Agent Toolkit Configuration#
The pipeline configuration will be similar to the NIM example, with the key differences being the selection of openai
as the _type
for the LLM and embedding models. The OpenAI clients we are using to communicate with the vLLM server expect an API key, we simply need to provide a value key, as the vLLM server does not require authentication.
examples/documentation_guides/locally_hosted_llms/vllm_config.yml
:
functions:
webpage_query:
_type: webpage_query
webpage_url: https://docs.smith.langchain.com
description: "Search for information about LangSmith. For any questions about LangSmith, you must use this tool!"
embedder_name: vllm_embedder
chunk_size: 512
current_datetime:
_type: current_datetime
llms:
vllm_llm:
_type: openai
api_key: "EMPTY"
base_url: "http://localhost:8000/v1"
model_name: nvidia/Llama-3.1-Nemotron-Nano-4B-v1.1
embedders:
vllm_embedder:
_type: openai
api_key: "EMPTY"
base_url: "http://localhost:8001/v1"
model_name: ssmits/Qwen2-7B-Instruct-embed-base
workflow:
_type: react_agent
tool_names: [webpage_query, current_datetime]
llm_name: vllm_llm
verbose: true
parse_agent_response_max_retries: 3
Running the NeMo Agent Toolkit Workflow#
To run the workflow using the locally hosted LLMs, run the following command:
nat run --config_file examples/documentation_guides/locally_hosted_llms/vllm_config.yml --input "What is LangSmith?"