Edge Deployment#
This guide covers deploying the base vision agent with NVIDIA Nemotron Edge 4B or NVIDIA Nemotron Nano 9B v2 as the LLM on edge platforms (DGX Spark, IGX Thor, AGX Thor).
For small models like NVIDIA Nemotron Edge 4B, a dedicated configuration file (config_edge.yml) is provided.
It inherits all settings from the standard config.yml and overrides only the planning prompt with a simplified
version optimized for smaller models.
Note
The edge configuration does not encourage clarifying questions from the agent prompt. Hence, in ambiguous situations,
the agent with config_edge.yml may tend to not request more context from the user.
If this behavior is not desired, use a larger model like NVIDIA Nemotron Nano 9B v2 which will ask clarifying questions
when the user’s request does not contain enough context.
Prerequisites#
Before deploying, complete the prerequisites for your platform. Also ensure to reboot the system and free up the GPU completely before deploying.
You will also need:
NGC CLI API Key — for pulling container images
HF Token — for pulling model weights from Hugging Face
NVIDIA Nemotron Edge 4B running locally or on a reachable endpoint (e.g. via vLLM on port 30081)
For more details on VSS configurations and how to get API tokens, refer to:
DGX Spark#
Start the Edge 4B model. For example, using vLLM:
export HF_TOKEN=$HF_TOKEN docker run --gpus all -d --name nemotron-edge -p 30081:8000 \ -e HF_TOKEN=$HF_TOKEN \ nvcr.io/nvidia/vllm:26.02-py3 \ python3 -m vllm.entrypoints.openai.api_server \ --model nvidia/NVIDIA-Nemotron-Edge-4B-v2.1-EA-020126_FP8 \ --trust-remote-code \ --gpu-memory-utilization 0.25 \ --enable-auto-tool-choice \ --tool-call-parser qwen3_coder \ --port 8000
Deploy the agent workflow:
export NVIDIA_API_KEY=$NVIDIA_API_KEY export NGC_CLI_API_KEY=$NGC_CLI_API_KEY export LLM_ENDPOINT_URL=http://localhost:30081 export VSS_AGENT_CONFIG_FILE=./deployments/developer-workflow/dev-profile-base/vss-agent/configs/config_edge.yml deployments/dev-profile.sh up -p base \ --use-remote-llm \ --llm nvidia/NVIDIA-Nemotron-Edge-4B-v2.1-EA-020126_FP8 \ --hardware-profile DGX-SPARK \ --vlm-env-file deployments/nim/cosmos-reason2-8b/hw-DGX-SPARK-shared.env
The
--vlm-env-filelimits the VLM KV-cache (NIM_KVCACHE_PERCENT=0.4) so both models fit in GPU memory.Note
If you prefer the agent to handle ambiguous or incomplete user queries better with clarifying questions, use the Nemotron Nano 9B model. This larger model should also be able to handle more complex user queries and situations. In this case, you can skip running the vLLM container entirely, as the blueprint will deploy both the LLM and VLM as NIM containers.
# Free up GPU for a full blueprint deployment # docker stop nemotron-edge && docker rm nemotron-edge deployments/dev-profile.sh up -p base \ --hardware-profile DGX-SPARK \ --llm nvidia/NVIDIA-Nemotron-Nano-9B-v2-FP8 \ --vlm nvidia/cosmos-reason2-8b
This uses the standard
config.yml(noconfig_edge.ymloverride needed) since the 9B model handles complex and ambiguous queries better.
AGX Thor / IGX Thor#
On AGX Thor and IGX Thor, the Edge 4B LLM runs locally alongside rtvi-vlm (the VLM
used by the blueprint on Thor). Both models share the same GPU, so their memory budgets
must be tuned to coexist.
Start the Edge 4B model on the device:
export HF_TOKEN=$HF_TOKEN docker run --gpus all -d --name nemotron-edge -p 30081:8000 \ --runtime=nvidia \ -e NVIDIA_VISIBLE_DEVICES=0 \ -e HF_TOKEN=$HF_TOKEN \ ghcr.io/nvidia-ai-iot/vllm:latest-jetson-thor \ python3 -m vllm.entrypoints.openai.api_server \ --model nvidia/NVIDIA-Nemotron-Edge-4B-v2.1-EA-020126_FP8 \ --trust-remote-code \ --gpu-memory-utilization 0.25 \ --enable-auto-tool-choice \ --tool-call-parser qwen3_coder \ --port 8000
Deploy the agent workflow:
export NVIDIA_API_KEY=$NVIDIA_API_KEY export NGC_CLI_API_KEY=$NGC_CLI_API_KEY export LLM_ENDPOINT_URL=http://localhost:30081 export VSS_AGENT_CONFIG_FILE=./deployments/developer-workflow/dev-profile-base/vss-agent/configs/config_edge.yml # Uses the default GPU memory utilization of 0.35 for rtvi-vlm deployments/dev-profile.sh up -p base \ --use-remote-llm \ --llm nvidia/NVIDIA-Nemotron-Edge-4B-v2.1-EA-020126_FP8 \ --hardware-profile AGX-THOR
Note
If you prefer the agent to handle ambiguous or incomplete user queries better with clarifying questions, use the Nemotron Nano 9B model. This larger model should also be able to handle more complex user queries and situations. In this case, you can skip running the vLLM container entirely, as the blueprint will deploy both the LLM and VLM as NIM containers.
# Free up GPU for a full blueprint deployment # docker stop nemotron-edge && docker rm nemotron-edge deployments/dev-profile.sh up -p base \ --hardware-profile AGX-THOR \ --llm nvidia/NVIDIA-Nemotron-Nano-9B-v2-FP8
This uses the standard
config.yml(noconfig_edge.ymloverride needed) since the 9B model handles complex and ambiguous queries better.Note
For IGX Thor, replace
AGX-THORwithIGX-THORin the commands above.