Using Local LLMs#

NeMo Agent toolkit has the ability to interact with locally hosted LLMs, in this guide we will demonstrate how to adapt the simple example (examples/getting_started/simple_web_query) to use locally hosted LLMs using two different approaches using NVIDIA NIM and vLLM, though any locally hosted LLM with an OpenAI-compatible API can be used.

Using NIM#

In the NeMo Agent toolkit simple example the meta/llama-3.1-70b-instruct model was used. For the purposes of this guide we will be using a smaller model, the nvidia/Llama-3.1-Nemotron-Nano-4B-v1.1 which is more likely to be runnable on a local workstation.

Regardless of the model you choose, the process is the same for downloading the model’s container from build.nvidia.com. Navigate to the model you wish to run locally, if it is able to be downloaded it will be labeled with the RUN ANYWHERE tag, the exact commands will be specified on the Deploy tab for the model.

Requirements#

An NVIDIA GPU with CUDA support (exact requirements depend on the model you are using)
The NVIDIA Container Toolkit
An NVIDIA API key, refer to Obtaining API Keys for more information.

Install the Simple Web Query Example#

First, ensure the current working directory is the root of the NeMo Agent toolkit repository. Then, install NAT and the simple web query example.

uv pip install -e .
uv pip install -e examples/getting_started/simple_web_query

Downloading the NIM Containers#

$ docker login nvcr.io
Username: $oauthtoken
Password: <PASTE_API_KEY_HERE>

Download the container for the LLM:

docker pull nvcr.io/nim/nvidia/llama3.1-nemotron-nano-4b-v1.1:latest

Download the container for the embedding Model:

docker pull nvcr.io/nim/nvidia/nv-embedqa-e5-v5:latest

Running the NIM Containers#

Note

The --gpus flag is used to specify the GPUs to use for the LLM and embedding model. Each user’s setup may vary, so adjust the commands to suit the system.

Run the LLM container listening on port 8000:

export NGC_API_KEY=<PASTE_API_KEY_HERE>
export LOCAL_NIM_CACHE=~/.cache/nim
mkdir -p "$LOCAL_NIM_CACHE"
docker run -it --rm \
    --gpus 0 \
    --shm-size=16GB \
    -e NGC_API_KEY \
    -v "$LOCAL_NIM_CACHE:/opt/nim/.cache" \
    -u $(id -u) \
    -p 8000:8000 \
    nvcr.io/nim/nvidia/llama3.1-nemotron-nano-4b-v1.1:latest

Open a new terminal and run the embedding model container, listening on port 8001:

export NGC_API_KEY=<PASTE_API_KEY_HERE>
export LOCAL_NIM_CACHE=~/.cache/nim
docker run -it --rm \
    --gpus 1 \
    --shm-size=16GB \
    -e NGC_API_KEY \
    -v "$LOCAL_NIM_CACHE:/opt/nim/.cache" \
    -u $(id -u) \
    -p 8001:8000 \
    nvcr.io/nim/nvidia/nv-embedqa-e5-v5:latest

NeMo Agent Toolkit Configuration#

To define the pipeline configuration, we will start with the examples/getting_started/simple_web_query/configs/config.yml file and modify it to use the locally hosted LLMs, the only changes needed are to define the base_url for the LLM and embedding models, along with the names of the models to use.

examples/documentation_guides/locally_hosted_llms/nim_config.yml:

functions:
  webpage_query:
    _type: webpage_query
    webpage_url: https://docs.smith.langchain.com
    description: "Search for information about LangSmith. For any questions about LangSmith, you must use this tool!"
    embedder_name: nv-embedqa-e5-v5
    chunk_size: 512
  current_datetime:
    _type: current_datetime

llms:
  nim_llm:
    _type: nim
    base_url: "http://localhost:8000/v1"
    model_name: nvidia/Llama-3.1-Nemotron-Nano-4B-v1.1

embedders:
  nv-embedqa-e5-v5:
    _type: nim
    base_url: "http://localhost:8001/v1"
    model_name: nvidia/nv-embedqa-e5-v5

workflow:
  _type: react_agent
  tool_names: [webpage_query, current_datetime]
  llm_name: nim_llm
  verbose: true
  parse_agent_response_max_retries: 3

Running the NeMo Agent Toolkit Workflow#

To run the workflow using the locally hosted LLMs, run the following command:

nat run --config_file examples/documentation_guides/locally_hosted_llms/nim_config.yml --input "What is LangSmith?"

Using vLLM#

vLLM provides an OpenAI-Compatible Server allowing us to re-use our existing OpenAI clients.

If you have not already done so, install vLLM following the Quickstart guide. It is recommended to use a separate virtual environment for vLLM due to potential conflicts with NeMo Agent toolkit dependencies.

Similar to the previous example we will be using the same nvidia/Llama-3.1-Nemotron-Nano-4B-v1.1 LLM model. Along with the ssmits/Qwen2-7B-Instruct-embed-base embedding model.

Install the Simple Web Query Example#

First, ensure the current working directory is the root of the NeMo Agent toolkit repository. Then, install NAT and the simple web query example.

uv pip install -e .
uv pip install -e examples/getting_started/simple_web_query

Serving the Models#

Similar to the NIM approach we will be running the LLM on the default port of 8000 and the embedding model on port 8001.

Note

The CUDA_VISIBLE_DEVICES environment variable is used to specify the GPUs to use for the LLM and embedding model. Each user’s setup may vary, so adjust the commands to suit the system.

In a terminal from within the vLLM environment, run the following command to serve the LLM:

CUDA_VISIBLE_DEVICES=0 vllm serve nvidia/Llama-3.1-Nemotron-Nano-4B-v1.1

In a second terminal also from within the vLLM environment, run the following command to serve the embedding model:

CUDA_VISIBLE_DEVICES=1 vllm serve --task embed --override-pooler-config '{"pooling_type": "MEAN"}' --port 8001 ssmits/Qwen2-7B-Instruct-embed-base

Note

The --override-pooler-config flag is taken from the vLLM Supported Models documentation.

NeMo Agent Toolkit Configuration#

The pipeline configuration will be similar to the NIM example, with the key differences being the selection of openai as the _type for the LLM and embedding models. The OpenAI clients we are using to communicate with the vLLM server expect an API key, we simply need to provide a value key, as the vLLM server does not require authentication. examples/documentation_guides/locally_hosted_llms/vllm_config.yml:

functions:
  webpage_query:
    _type: webpage_query
    webpage_url: https://docs.smith.langchain.com
    description: "Search for information about LangSmith. For any questions about LangSmith, you must use this tool!"
    embedder_name: vllm_embedder
    chunk_size: 512
  current_datetime:
    _type: current_datetime

llms:
  vllm_llm:
    _type: openai
    api_key: "EMPTY"
    base_url: "http://localhost:8000/v1"
    model_name: nvidia/Llama-3.1-Nemotron-Nano-4B-v1.1

embedders:
  vllm_embedder:
    _type: openai
    api_key: "EMPTY"
    base_url: "http://localhost:8001/v1"
    model_name: ssmits/Qwen2-7B-Instruct-embed-base

workflow:
  _type: react_agent
  tool_names: [webpage_query, current_datetime]
  llm_name: vllm_llm
  verbose: true
  parse_agent_response_max_retries: 3

Running the NeMo Agent Toolkit Workflow#

To run the workflow using the locally hosted LLMs, run the following command:

nat run --config_file examples/documentation_guides/locally_hosted_llms/vllm_config.yml --input "What is LangSmith?"

Other Locally Hosted LLMs#

Any locally hosted LLM with an OpenAI-compatible API can be used with the NeMo Agent toolkit. The only changes needed are to define the base_url for the LLM and embedding models, along with the names of the models to use.

For example, to use the gpt-oss-20b model, the following configuration can be used:

llms:
  gpt-oss:
    _type: openai
    api_key: "EMPTY"
    base_url: "http://localhost:8000/v1"
    model_name: gpt-oss-20b