API Reference#

Endpoints Schema#

The following are endpoints for the NVIDIA NIM for Cosmos WFM (World Foundation Models):

  • /v1/infer

  • /v1/health/ready

  • /v1/health/live

  • /v1/license

  • /v1/metrics

  • /v1/metadata

  • /v1/manifest

Note

Cosmos3-Generator additionally exposes /v1/version and a machine-readable OpenAPI document at /openapi.json. GET /v1/metadata carries an extra checkpoint field whose value reflects the NIM_FT_CHECKPOINT override when Bring your own checkpoint for Cosmos3-Generator is active.

Note

For Cosmos-Reason1 NIM API documentation, refer to the NVIDIA NIM for VLMs site.

API Examples#

Use the examples in this section to get started with using the API.

Check Health#

Use the following command to check server health.

cURL Request

curl -X 'GET' 'http://0.0.0.0:8000/v1/health/ready'

Response

{
   "description":"Triton readiness check",
   "status":"ready"
}

Generate Sample#

Use the following command to generate a video sample. The generation process can take several minutes, depending on the hardware used and the selected profile. For more information on performance characteristics, refer to the Supported Models section.

cURL Request

curl -X 'POST' \
'http://0.0.0.0:8000/v1/infer' \
   -H 'Accept: application/json' \
   -H 'Content-Type: application/json' \
   -d '{
         "prompt": "first person view from a camera in a car driving down a two lane neighborhood street, viewed from the dashcam as we drive down the street. The camera faces forward. There are nice houses and sidewalks in this suburban area with green grass front yards and flower gardens and large oak trees. It is a rainy day and there are grey clouds overhead. The road has puddles on it, reflecting the sky overhead. The windshield wipers flash by.",
         "negative_prompt": "blurry, low quality, artifacts, people",
         "prompt_upsampling": true,
         "seed": 4,
         "guidance_scale": 7.5,
         "steps": 50,
         "video_params": {
            "height": 704,
            "width": 1280,
            "frames_count": 121,
            "frames_per_sec": 24
         }
   }'

Response

The server will respond with the following:

{
   "b64_video": "<base64EncodedVideoString>",
   "upsampled_prompt": "first person view from a camera in a car driving down a two lane neighborhood street, viewed from the dashcam as we drive down the street. The camera faces forward. There are nice houses and sidewalks in this suburban area with green grass front yards and flower gardens and large oak trees. It is a rainy day and there are grey clouds overhead. The road has puddles on it, reflecting the sky overhead. The windshield wipers flash by.",
   "seed": 4
}

cURL Request

curl -X POST \
http://0.0.0.0:8000/v1/infer \
   -H 'Content-Type: application/json' \
   -d '{
      "prompt": "The video is a wide shot of a large industrial facility, likely a chemical plant or factory, situated in a rural or semi-industrial area. The scene is set during a partly cloudy day, with the sky showing patches of blue and white clouds. The facility is surrounded by a vast expanse of green fields, indicating its location in a countryside or suburban area. The factory itself is a large, rectangular building with a flat roof, constructed from concrete and metal. It features several large cylindrical tanks and pipes, suggesting the processing of chemicals or liquids. The tanks are arranged in a linear fashion along the side of the building, and there are several smaller structures and equipment scattered around the premises. The camera remains static throughout the video, capturing the entire facility from a distance, allowing viewers to observe the layout and scale of the operations. The lighting is natural, with sunlight casting shadows on the ground, enhancing the details of the industrial setup. There are no visible human activities or movements, indicating that the video might be a documentary or an informational piece about industrial processes.", "negative_prompt": "blurry, low quality, artifacts, people",
      "image": "https://assets.ngc.nvidia.com/products/api-catalog/cosmos/industry_01_prompt.jpg",
      "seed": 42,
      "guidance_scale": 7.5,
      "steps": 35,
      "video_params": {
         "height": 704,
         "width": 1280,
         "frames_count": 121,
         "frames_per_sec": 24
      }
   }'

The image field should be a URL to the image location or a base64-encoded image. If the NIM_ALLOW_URL_INPUT environment variable is set to 0, the image field does not accept URLs and a base64 encoded image must be provided.

Response

The server will respond with the following:

{
   "b64_video": "<base64EncodedVideoString>",
   "seed": 42
}

For Video2World, the video field is required instead of image.

cURL Request

curl -X POST \
http://0.0.0.0:8000/v1/infer \
   -H 'Content-Type: application/json' \
   -d '{
      "prompt": "A first person view from the perspective from a human sized robot as it works in a chemical plant. The robot has many boxes and supplies nearby on the industrial shelves. The camera on moving forward, at a height of 1m above the floor. Photorealistic",
      "video": "https://assets.ngc.nvidia.com/products/api-catalog/cosmos/ar_result_default_robot.mp4",
      "seed": 42,
      "guidance_scale": 7.5,
      "steps": 35,
      "video_params": {
         "height": 704,
         "width": 1280,
         "frames_count": 121,
         "frames_per_sec": 24
      }
   }'

The video field should be a URL to the video location or a base64-encoded video. If the NIM_ALLOW_URL_INPUT environment variable is set to 0, the video field does not accept URLs and base64 encoded video must be provided.

Response

The server will respond with the following:

{
   "b64_video": "<base64EncodedVideoString>",
   "seed": 42
}

cURL Request

curl -H 'Content-Type: application/json' -X POST http://0.0.0.0:8000/v1/infer -d '{
   "prompt": "Two robotic arms manipulate blue fabric on a yellow cushion in a neutral lab setting.",
   "video": "https://raw.githubusercontent.com/abhinavg4/cosmos-transfer2.5/main/assets_nim/low/robot_input.mp4",
   "resolution": "480",
   "edge": {
      "control_weight": 1.0,
      "control": "https://raw.githubusercontent.com/abhinavg4/cosmos-transfer2.5/main/assets_nim/low/edge/robot_edge.mp4"
   }
}'

Response

The video field should be a URL to the video location or a base64-encoded video. At least one control field (edge, depth, vis, or seg) must be provided. The server will respond with the following:

{
   "b64_video": "<base64EncodedVideoString>",
   "seed": 42
}

Cosmos3-Generator infers the generation mode automatically from the request fields: non-empty prompt (no image) → TEXT2VIDEO; image provided → IMAGE2VIDEO.

cURL Request

curl -X POST 'http://0.0.0.0:8000/v1/infer' \
    -H 'Accept: application/json' \
    -H 'Content-Type: application/json' \
    -d '{
        "prompt": "The robot walks forward and picks up the box on the shelf.",
        "seed": 42,
        "guidance_scale": 6.0,
        "steps": 35,
        "resolution": "480_16_9",
        "num_output_frames": 121,
        "fps": 24.0
    }'

Response

{
   "b64_video": "<base64EncodedVideoString>"
}

For I2V, the image field is required (with optional prompt).

cURL Request

curl -X POST 'http://0.0.0.0:8000/v1/infer' \
    -H 'Accept: application/json' \
    -H 'Content-Type: application/json' \
    -d '{
        "prompt": "The robot walks forward and picks up the box on the shelf.",
        "image": "https://example.com/first_frame.jpg",
        "seed": 42
    }'

The image field accepts raw base64, a data URI (data:image/png;base64,...), or a public URL. URLs are disabled when NIM_ALLOW_URL_INPUT=0.

Response

{
   "b64_video": "<base64EncodedVideoString>"
}

See Sampling Control (Cosmos3-Generator tab) for the full request schema.

Error Handling#

The API returns standard HTTP status codes to indicate success or failure:

  • 200 OK: Request successful

  • 400 Bad Request: Invalid input parameters

  • 500 Internal Server Error: Server-side error

Common error scenarios include the following:

  • Invalid input dimensions (height/width must be multiples of 8)

  • Malformed JSON in the request body

Tip

Refer to the troubleshooting page for additional steps to debug errors.

Reference#

Note

A running Cosmos3-Generator NIM also exposes the live OpenAPI 3.1 spec at GET /openapi.json; use that endpoint for the always-fresh schema between doc releases.

Output resolution shapes for Cosmos3-Generator#

The resolution request field on Cosmos3-Generator accepts a tier prefix (256, 480, 720) plus an optional aspect-ratio suffix. Bare tier keys are aliases for <tier>_16_9; the default is "720" (≡ "720_16_9" → 1280 × 720).

Output shapes as W × H (request key → pixels):

Aspect suffix

256_*

480_*

720_*

_16_9 (landscape, default for bare key)

320 × 192

832 × 480

1280 × 720

_1_1 (square)

256 × 256

640 × 640

960 × 960

_9_16 (portrait)

192 × 320

480 × 832

720 × 1280

_4_3 (landscape 4:3)

320 × 256

736 × 544

1104 × 832

_3_4 (portrait 3:4)

256 × 320

544 × 736

832 × 1104

Per-tier maximum num_output_frames:

Tier

Maximum num_output_frames

256_*

397

480_*

297

720_*

197

Pixel shapes are taken from the Cosmos3 canonical resolution table.