NVIDIA NeMo Agent Toolkit API Server Endpoints#

There are currently four workflow transactions that can be initiated using HTTP or WebSocket when the NeMo Agent toolkit server is running: generate non-streaming,generate streaming, chat non-streaming, and chat streaming. The following are types of interfaces you can use to interact with your running workflows.

  • Generate Interface: Uses the transaction schema defined by your workflow. The interface documentation is accessible using Swagger while the server is running http://localhost:8000/docs.

  • Chat Interface: OpenAI API Documentation provides details on chat formats compatible with the NeMo Agent toolkit server.

Generate Non-Streaming Transaction#

  • Route: /generate

  • Description: A non-streaming transaction that waits until all workflow data is available before sending the result back to the client. The transaction schema is defined by the workflow.

  • HTTP Request Example:

    curl --request POST \
      --url http://localhost:8000/generate \
      --header 'Content-Type: application/json' \
      --data '{
        "input_message": "Is 4 + 4 greater than the current hour of the day"
      }'
    
  • HTTP Response Example:

    {
      "value":"No, 4 + 4 is not greater than the current hour of the day."
    }
    

Generate Streaming Transaction#

  • Route: /generate/stream

  • Description: A streaming transaction that allows data to be sent in chunks as it becomes available from the workflow, rather than waiting for the complete response to be available.

  • HTTP Request Example:

    curl --request POST \
      --url http://localhost:8000/generate/stream \
      --header 'Content-Type: application/json' \
      --data '{
        "input_message": "Is 4 + 4 greater than the current hour of the day"
      }'
    
  • HTTP Intermediate Step Stream Example:

    "intermediate_data": {
      "id": "ba5191e6-b818-4206-ac14-863112e597fe",
      "parent_id": "5db32854-d9b2-4e75-9001-543da6a55dd0",
      "type": "markdown",
      "name": "meta/llama-3.1-70b-instruct",
      "payload": "**Input:**\n```python\n[SystemMessage(content='\\nAnswer the following questions as best you can. You
                  may ask the human to use the following tools:\\n\\ncalculator_multiply: This is a mathematical tool used to multiply
                  two numbers together. It takes 2 numbers as an input and computes their numeric product as the output.. . Arguments
                  must be provided as a valid JSON object following this format: {\\'text\\': FieldInfo(annotation=str,
                  required=True)}\\ncalculator_inequality: This is a mathematical tool used to perform an inequality comparison
                  between two numbers. It takes two numbers as an input and determines if one is greater or are equal.. . Arguments
                  must be provided as a valid JSON object following this format: {\\'text\\': FieldInfo(annotation=str,
                  required=True)}\\ncurrent_datetime: Returns the current date and time in human readable format.. . Arguments must
                  be provided as a valid JSON object following this format: {\\'unused\\': FieldInfo(annotation=str, required=True)}
                  \\ncalculator_divide: This is a mathematical tool used to divide one number by another. It takes 2 numbers as an
                  input and computes their numeric quotient as the output.. . Arguments must be provided as a valid JSON object
                  following this format: {\\'text\\': FieldInfo(annotation=str, required=True)}\\n\\nYou may respond in one of two
                  formats.\\nUse the following format exactly to ask the human to use a tool:\\n\\nQuestion: the input question you
                  must answer\\nThought: you should always think about what to do\\nAction: the action to take, should be one of
                  [calculator_multiply,calculator_inequality,current_datetime,calculator_divide]\\nAction Input: the input to the
                  action (if there is no required input, include \"Action Input: None\")  \\nObservation: wait for the human to
                  respond with the result from the tool, do not assume the response\\n\\n... (this Thought/Action/Action
                  Input/Observation can repeat N times. If you do not need to use a tool, or after asking the human to use any tools
                  and waiting for the human to respond, you might know the final answer.)\\nUse the following format once you have
                  the final answer:\\n\\nThought: I now know the final answer\\nFinal Answer: the final answer to the original input
                  question\\n', additional_kwargs={}, response_metadata={}), HumanMessage(content='\\nQuestion: Is 4 + 4 greater
                  than the current hour of the day\\n', additional_kwargs={}, response_metadata={}), AIMessage(content='Thought:
                  To answer this question, I need to know the current hour of the day and compare it to 4 + 4.\\n\\nAction:
                  current_datetime\\nAction Input: None\\n\\n', additional_kwargs={}, response_metadata={}), HumanMessage(content='The
                  current time of day is 2025-03-11 16:05:11', additional_kwargs={}, response_metadata={}),
                  AIMessage(content=\"Thought: Now that I have the current time, I can extract the hour and compare it to 4 + 4.
                  \\n\\nAction: calculator_multiply\\nAction Input: {'text': '4 + 4'}\", additional_kwargs={}, response_metadata={}),
                  HumanMessage(content='The product of 4 * 4 is 16', additional_kwargs={}, response_metadata={}),
                  AIMessage(content=\"Thought: Now that I have the result of 4 + 4, which is 8, I can compare it to the current
                  hour.\\n\\nAction: calculator_inequality\\nAction Input: {'text': '8 > 16'}\", additional_kwargs={},
                  response_metadata={}), HumanMessage(content='First number 8 is less than the second number 16',
                  additional_kwargs={}, response_metadata={})]\n```\n\n**Output:**\nThought: I now know the final answer\n\nFinal
                  Answer: No, 4 + 4 (which is 8) is not greater than the current hour of the day (which is 16)."
    }
    
  • HTTP Response Example:

    "data": { "value": "No, 4 + 4 (which is 8) is not greater than the current hour of the day (which is 15)." }
    

Generate Streaming Full Transaction#

  • Route: /generate/full

  • Description: Same as /generate/stream but provides raw IntermediateStep objects without any step adaptor translations. Use the filter_steps query parameter to filter steps by type (comma-separated list) or set to ‘none’ to suppress all intermediate steps.

  • HTTP Request Example:

    curl --request POST \
    --url http://localhost:8000/generate/full \
    --header 'Content-Type: application/json' \
    --data '{
      "input_message": "Is 4 + 4 greater than the current hour of the day"
    }'
    
  • HTTP Intermediate Step Stream Example:

    "intermediate_data": {"id":"dda55b33-edd1-4dde-b938-182676a42a19","parent_id":"8282eb42-01dd-4db6-9fd5-915ed4a2a032","type":"LLM_END","name":"meta/llama-3.1-70b-instruct","payload":"{\"event_type\":\"LLM_END\",\"event_timestamp\":1744051441.449566,\"span_event_timestamp\":1744051440.5072863,\"framework\":\"langchain\",\"name\":\"meta/llama-3.1-70b-instruct\",\"tags\":null,\"metadata\":{\"chat_responses\":[{\"text\":\"Thought: I now know the final answer\\n\\nFinal Answer: No, 4 + 4 (which is 8) is not greater than the current hour of the day (which is 11).\",\"generation_info\":null,\"type\":\"ChatGenerationChunk\",\"message\":{\"content\":\"Thought: I now know the final answer\\n\\nFinal Answer: No, 4 + 4 (which is 8) is not greater than the current hour of the day (which is 11).\",\"additional_kwargs\":{},\"response_metadata\":{\"finish_reason\":\"stop\",\"model_name\":\"meta/llama-3.1-70b-instruct\"},\"type\":\"AIMessageChunk\",\"name\":null,\"id\":\"run-dda55b33-edd1-4dde-b938-182676a42a19\"}}],\"chat_inputs\":null,\"tool_inputs\":null,\"tool_outputs\":null,\"tool_info\":null},\"data\":{\"input\":\"First number 8 is less than the second number 11\",\"output\":\"Thought: I now know the final answer\\n\\nFinal Answer: No, 4 + 4 (which is 8) is not greater than the current hour of the day (which is 11).\",\"chunk\":null},\"usage_info\":{\"token_usage\":{\"prompt_tokens\":37109,\"completion_tokens\":902,\"total_tokens\":38011},\"num_llm_calls\":0,\"seconds_between_calls\":0},\"UUID\":\"dda55b33-edd1-4dde-b938-182676a42a19\"}"}
    
  • HTTP Response Example:

    "data": {"value":"No, 4 + 4 (which is 8) is not greater than the current hour of the day (which is 11)."}
    
  • HTTP Request Example with Filter: By default, all intermediate steps are streamed. Use the filter_steps query parameter to filter steps by type (comma-separated list) or set to none to suppress all intermediate steps.

    Suppress all intermediate steps (only get final output):

    curl --request POST \
      --url 'http://localhost:8000/generate/full?filter_steps=none' \
      --header 'Content-Type: application/json' \
      --data '{"input_message": "Is 4 + 4 greater than the current hour of the day"}'
    

    Get only specific step types:

    curl --request POST \
      --url 'http://localhost:8000/generate/full?filter_steps=LLM_END,TOOL_END' \
      --header 'Content-Type: application/json' \
      --data '{"input_message": "Is 4 + 4 greater than the current hour of the day"}'
    

Chat Non-Streaming Transaction#

  • Route: /chat

  • Description: An OpenAI compatible non-streaming chat transaction.

  • HTTP Request Example:

    curl --request POST \
    --url http://localhost:8000/chat \
    --header 'Content-Type: application/json' \
    --data '{
      "messages": [
        {
          "role": "user",
          "content":  "Is 4 + 4 greater than the current hour of the day"
        }
      ],
      "use_knowledge_base": true
    }'
    
  • HTTP Response Example:

    {
      "id": "b92d1f05-200a-4540-a9f1-c1487bfb3685",
      "object": "chat.completion",
      "model": "",
      "created": "2025-03-11T21:12:43.671665Z",
      "choices": [
          {
              "message": {
                  "content": "No, 4 + 4 (which is 8) is not greater than the current hour of the day (which is 16).",
                  "role": null
              },
              "finish_reason": "stop",
              "index": 0
          }
      ],
      "usage": {
          "prompt_tokens": 0,
          "completion_tokens": 20,
          "total_tokens": 20
      }
    }
    

Chat Streaming Transaction#

  • Route: /chat/stream

  • Description: An OpenAI compatible streaming chat transaction.

  • HTTP Request Example:

    curl --request POST \
    --url http://localhost:8000/chat/stream \
    --header 'Content-Type: application/json' \
    --data '{
      "messages": [
        {
          "role": "user",
          "content":  "Is 4 + 4 greater than the current hour of the day"
        }
      ],
      "use_knowledge_base": true
    }'
    
  • HTTP Intermediate Step Example:

    "intermediate_data": {
      "id": "9ed4bce7-191c-41cb-be08-7a72d30166cc",
      "parent_id": "136edafb-797b-42cd-bd11-29153359b193",
      "type": "markdown",
      "name": "meta/llama-3.1-70b-instruct",
      "payload": "**Input:**\n```python\n[SystemMessage(content='\\nAnswer the following questions as best you can. You
                  may ask the human to use the following tools:\\n\\ncalculator_multiply: This is a mathematical tool used to multiply
                  two numbers together. It takes 2 numbers as an input and computes their numeric product as the output.. . Arguments
                  must be provided as a valid JSON object following this format: {\\'text\\': FieldInfo(annotation=str,
                  required=True)}\\ncalculator_inequality: This is a mathematical tool used to perform an inequality comparison
                  between two numbers. It takes two numbers as an input and determines if one is greater or are equal.. .
                  Arguments must be provided as a valid JSON object following this format: {\\'text\\': FieldInfo(annotation=str,
                  required=True)}\\ncurrent_datetime: Returns the current date and time in human readable format.. . Arguments
                  must be provided as a valid JSON object following this format: {\\'unused\\': FieldInfo(annotation=str,
                  required=True)}\\ncalculator_divide: This is a mathematical tool used to divide one number by another. It takes
                  2 numbers as an input and computes their numeric quotient as the output.. . Arguments must be provided as a
                  valid JSON object following this format: {\\'text\\': FieldInfo(annotation=str, required=True)}\\n\\nYou may
                  respond in one of two formats.\\nUse the following format exactly to ask the human to use a tool:\\n\\nQuestion:
                  the input question you must answer\\nThought: you should always think about what to do\\nAction: the action to
                  take, should be one of [calculator_multiply,calculator_inequality,current_datetime,calculator_divide]\\nAction
                  Input: the input to the action (if there is no required input, include \"Action Input: None\")  \\nObservation:
                  wait for the human to respond with the result from the tool, do not assume the response\\n\\n...
                  (this Thought/Action/Action Input/Observation can repeat N times. If you do not need to use a tool, or after
                  asking the human to use any tools and waiting for the human to respond, you might know the final answer.)\\nUse
                  the following format once you have the final answer:\\n\\nThought: I now know the final answer\\nFinal Answer:
                  the final answer to the original input question\\n', additional_kwargs={}, response_metadata={}),
                  HumanMessage(content='\\nQuestion: Is 4 + 4 greater than the current hour of the day\\n', additional_kwargs={},
                  response_metadata={}), AIMessage(content='Thought: To answer this question, I need to know the current hour of
                  the day and compare it to 4 + 4.\\n\\nAction: current_datetime\\nAction Input: None\\n\\n', additional_kwargs={},
                  response_metadata={}), HumanMessage(content='The current time of day is 2025-03-11 16:24:52',
                  additional_kwargs={}, response_metadata={}), AIMessage(content=\"Thought: Now that I have the current time, I can
                  extract the hour and compare it to 4 + 4.\\n\\nAction: calculator_multiply\\nAction Input: {'text': '4 + 4'}\",
                  additional_kwargs={}, response_metadata={}), HumanMessage(content='The product of 4 * 4 is 16',
                  additional_kwargs={}, response_metadata={}), AIMessage(content=\"Thought: Now that I have the result of 4 + 4,
                  which is 8, I can compare it to the current hour.\\n\\nAction: calculator_inequality\\nAction Input:
                  {'text': '8 > 16'}\", additional_kwargs={}, response_metadata={}), HumanMessage(content='First number 8 is
                  less than the second number 16', additional_kwargs={}, response_metadata={})]\n```\n\n**Output:**\nThought: I now
                  know the final answer\n\nFinal Answer: No, 4 + 4 (which is 8) is not greater than the current hour of the day
                  (which is 16)."
    }
    
  • HTTP Response Example:

    "data": {
      "id": "194d22dc-6c1b-44ee-a8d7-bf2b59c1cb6b",
      "choices": [
          {
              "message": {
                  "content": "No, 4 + 4 (which is 8) is not greater than the current hour of the day (which is 16).",
                  "role": null
              },
              "finish_reason": "stop",
              "index": 0
          }
      ],
      "created": "2025-03-11T21:24:56.961939Z",
      "model": "",
      "object": "chat.completion.chunk"
    }
    

OpenAI Chat Completions API Compatible Endpoint#

The NeMo Agent Toolkit provides full OpenAI Chat Completions API compatibility through a dedicated endpoint that enables seamless integration with existing OpenAI-compatible client libraries and workflows.

Overview#

When the OpenAI v1 compatible endpoint is configured, the toolkit creates a single endpoint that fully implements the OpenAI Chat Completions API specification. This endpoint handles both streaming and non-streaming requests based on the stream parameter, exactly like the official OpenAI API.

Key Benefits#

  • Drop-in Replacement: Works with existing OpenAI client libraries without code changes

  • Full API Compatibility: Supports all OpenAI Chat Completions API parameters

  • Industry Standard: Familiar interface for developers already using OpenAI

  • Future-Proof: Aligned with established API patterns and ecosystem tools

Configuration#

To enable the OpenAI v1 compatible endpoint, set openai_api_v1_path in your FastAPI front-end configuration:

general:
  front_end:
    _type: fastapi
    workflow:
      method: POST
      openai_api_v1_path: /v1/chat/completions

Configuration Options#

Parameter

Type

Default

Description

openai_api_v1_path

string

null

Path for the OpenAI v1 compatible endpoint

openai_api_path

string

/chat

Path for legacy OpenAI endpoints

method

string

POST

HTTP method for the endpoint

Endpoint Behavior#

OpenAI v1 Compatible Mode (openai_api_v1_path configured)#

Creates a single endpoint that handles both streaming and non-streaming requests:

  • Route: /v1/chat/completions (configurable via openai_api_v1_path)

  • Method: POST

  • Content-Type: application/json

  • Behavior: Routes to streaming or non-streaming based on stream parameter

Legacy Mode (openai_api_v1_path not configured)#

Creates separate endpoints for different request types:

  • Non-streaming: /<openai_api_path>

  • Streaming: <openai_api_path>/stream

Request Format#

The endpoint accepts all standard OpenAI Chat Completions API parameters:

Parameter

Type

Description

Validation

messages

array

Required. List of messages in conversation format

min 1 item

model

string

Model identifier

-

frequency_penalty

number

Decreases likelihood of repeating tokens

-2.0 to 2.0

logit_bias

object

Modify likelihood of specific tokens

token ID → bias

logprobs

boolean

Return log probabilities

-

top_logprobs

integer

Number of most likely tokens to return

0 to 20

max_tokens

integer

Maximum tokens to generate

≥ 1

n

integer

Number of completions to generate

1 to 128

presence_penalty

number

Increases likelihood of new topics

-2.0 to 2.0

response_format

object

Specify response format

-

seed

integer

Random seed for deterministic outputs

-

service_tier

string

Service tier selection

“auto” or “default”

stop

string/array

Stop sequences

-

stream

boolean

Enable streaming responses

default: false

stream_options

object

Streaming configuration options

-

temperature

number

Sampling temperature

0.0 to 2.0

top_p

number

Nucleus sampling parameter

0.0 to 1.0

tools

array

Available function tools

-

tool_choice

string/object

Tool selection strategy

-

parallel_tool_calls

boolean

Enable parallel tool execution

default: true

user

string

End-user identifier

-

Usage Examples#

cURL Examples#

Non-Streaming Request:

curl -X POST http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "nvidia/llama-3.1-8b-instruct",
    "messages": [
      {"role": "user", "content": "What is the capital of France?"}
    ],
    "stream": false,
    "temperature": 0.7,
    "max_tokens": 100
  }'

Streaming Request:

curl -X POST http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "nvidia/llama-3.1-8b-instruct",
    "messages": [
      {"role": "user", "content": "Tell me a short story"}
    ],
    "stream": true,
    "temperature": 0.7
  }'

Client Library Examples#

OpenAI Python Client:

from openai import OpenAI

# Initialize client pointing to your NeMo Agent Toolkit server
client = OpenAI(
    api_key="not-needed",  # API key not required for local deployment
    base_url="http://localhost:8000/v1"
)

# Non-streaming chat completion
response = client.chat.completions.create(
    model="nvidia/llama-3.1-8b-instruct",
    messages=[
        {"role": "user", "content": "Explain quantum computing in simple terms"}
    ],
    stream=False,
    temperature=0.7,
    max_tokens=150
)

print(response.choices[0].message.content)

AI SDK (JavaScript/TypeScript):

import { openai } from '@ai-sdk/openai';
import { generateText } from 'ai';

// Configure custom OpenAI provider
const customOpenAI = openai({
  baseURL: 'http://localhost:8000/v1',
  apiKey: 'not-needed'
});

// Non-streaming generation
const { text } = await generateText({
  model: customOpenAI('nvidia/llama-3.1-8b-instruct'),
  prompt: 'Explain the benefits of renewable energy',
  temperature: 0.7,
  maxTokens: 200
});

console.log(text);

Migration Guide#

From Legacy Mode#

If you’re currently using legacy mode with separate endpoints:

  1. Update Configuration: Set openai_api_v1_path: /v1/chat/completions

  2. Update Client Code: Use single endpoint with stream parameter

  3. Test Thoroughly: Verify both streaming and non-streaming functionality

From OpenAI API#

If you’re migrating from OpenAI’s API:

  1. Update Base URL: Point to your NeMo Agent Toolkit server

  2. Update Model Names: Use your configured model identifiers

  3. Test Compatibility: Verify all features work as expected

Evaluation Endpoint#

You can also evaluate workflows via the NeMo Agent toolkit evaluate endpoint. For more information, refer to the NeMo Agent toolkit Evaluation Endpoint documentation.

Choosing between Streaming and Non-Streaming#

Use streaming if you need real-time updates or live communication where users expect immediate feedback. Use non-streaming if your workflow responds with simple updates and less feedback is needed.

NeMo Agent Toolkit API Server Interaction Guide#

A custom user interface can communicate with the API server using both HTTP requests and WebSocket connections. For details on proper WebSocket messaging integration, refer to the WebSocket Messaging Interface documentation.