API Reference#
For the full OpenAPI 3.1 schema, you can access the interactive documentation or the raw JSON file:
Interactive Docs:
http://<host>:8000/docsOpenAPI JSON:
http://<host>:8000/openapi.json
Endpoints#
POST /v1/embeddings#
Generates embedding vectors for text or video inputs. This is the primary inference endpoint.
Request Body
Parameter |
Type |
Required |
Default |
Description |
|---|---|---|---|---|
|
string |
array[string] |
Yes |
|
|
string |
Yes |
Specifies the
processing mode, which can
be |
|
|
string |
Yes |
The ID of
the
embedding
model to
use, which currently must
be |
|
|
string |
No |
|
The format for the returned embeddings, which can be
|
Input String Formats
The input field accepts the following formats:
Plain Text: “A sentence to be embedded.”
Base64-Encoded Video: “data:video/mp4;base64,<base64-encoded-video-data>” (
queryonly)Presigned URL for Video: “data:video/mp4;presigned_url,https://your-url/video.mp4” (for
bulk_videoandquerymodes).Video Frames (Presigned URLs): “data:video_frames/jpg;presigned_url,{<frame_url_0>,<frame_url_1>,<frame_url_2>,<frame_url_3>,<frame_url_4>,<frame_url_5>,<frame_url_6>,<frame_url_7>}” (for
bulk_videoandquerymodes).Video Frames (Base64): “data:video_frames/jpg;base64,{<frame_0_b64>,<frame_1_b64>,<frame_2_b64>,<frame_3_b64>,<frame_4_b64>,<frame_5_b64>,<frame_6_b64>,<frame_7_b64>}” (for
bulk_videoandquerymodes).
Notes and constraints
querysupports a single input item (text, video, orvideo_frames).bulk_textsupports up to 64 plain-text strings per request.bulk_videosupports up to 64 items, where each item must be one of the following:data:video/<type>;presigned_url,<url>data:video_frames/<type>;presigned_url,{...}ordata:video_frames/<type>;base64,{...}
video_framesrequires exactly 8 frames per item.Maximum inputs per request: 64 items for
bulk_textandbulk_videomodesRecommended video duration: 15 seconds
Maximum recommended video duration: 1-2 minutes; no strict maximum is enforced by the NIM.
Supported codecs depend on the runtime decode stack (PyNvVideoCodec 2.0.3 / NVDEC) and the host GPU/driver; common codecs include H.264, HEVC, AV1, VP8, VP9, VC1, MPEG4, MPEG2, and MPEG1.
Base64-encoded full videos are accepted only in
querymode.
Text Query Payload
{
"input": "A fluffy white cat basking in the sun.",
"model": "nvidia/cosmos-embed1",
"request_type": "query"
}
Bulk Text Payload
{
"input": [
"This is the first sentence.",
"Here is a second one for batch processing."
],
"model": "nvidia/cosmos-embed1",
"request_type": "bulk_text"
}
Bulk Video Payload
{
"input": [
"data:video/webm;presigned_url,https://upload.wikimedia.org/wikipedia/commons/3/3d/Branko_Paukovic%2C_javelin_throw.webm",
"data:video/webm;presigned_url,https://upload.wikimedia.org/wikipedia/commons/3/3d/Branko_Paukovic%2C_javelin_throw.webm"
],
"model": "nvidia/cosmos-embed1",
"request_type": "bulk_video"
}
Video Frames Payload (Query)
{
"input": "data:video_frames/jpg;presigned_url,{<frame_url_0>,<frame_url_1>,<frame_url_2>,<frame_url_3>,<frame_url_4>,<frame_url_5>,<frame_url_6>,<frame_url_7>}",
"model": "nvidia/cosmos-embed1",
"request_type": "query"
}
Response Body
The response is a JSON object containing the generated embeddings and usage statistics.
Parameter |
Type |
Description |
|---|---|---|
object |
string |
The type of object, always list |
data |
array[object] |
An array of embedding objects |
model |
string |
The model used for the request (e.g.
|
usage |
object |
An object containing token and video counts |
Embedding Object
Parameter |
Type |
Description |
|---|---|---|
object |
string |
The type of object, always embedding. |
index |
integer |
The index of this embedding in the data array |
embedding |
array[float] |
The embedding vector |
Usage Object
Parameter |
Type |
Description |
|---|---|---|
num_videos |
integer |
Number of videos processed in the request. |
prompt_tokens |
integer |
Number of text tokens in the prompt. |
total_tokens |
integer |
Total tokens processed. |
Example Response
{
"object": "list",
"data": [
{
"object": "embedding",
"index": 0,
"embedding": [
0.0123456789,
-0.0987654321
]
}
],
"model": "nvidia/cosmos-embed1",
"usage": {
"num_videos": 0,
"prompt_tokens": 10,
"total_tokens": 10
}
}
GET /v1/health/ready#
Checks if the service is ready to accept inference requests.
Example Response
{
"object": "health.response",
"message": "NIM Service is ready"
}
GET /v1/health/live#
Checks if the service is running (live). It may not yet be ready for inference.
Example Response
{
"object": "health.response",
"message": "NIM Service is live"
}
GET /health/metrics#
Returns a JSON snapshot of service metrics (requests, latency percentiles, throughput, errors, and business counters). Latency values are in seconds.
This endpoint complements the Prometheus-compatible metrics at GET /v1/metrics.
Example Response
{
"service": {
"uptime_seconds": 12.3,
"start_time": 1730000000.0
},
"requests": {
"total": 10,
"success": 9,
"error": 1,
"success_rate_percent": 90.0,
"error_rate_percent": 10.0,
"in_flight": 0,
"in_flight_by_type": {
"query": 0,
"bulk_text": 0,
"bulk_video": 0
}
},
"requests_by_type": {
"success": {
"query": 8,
"bulk_video": 1
},
"error": {
"bulk_video": 1
}
},
"requests_by_input_type": {
"total": {
"text": 8,
"video_presigned_url": 2
},
"success": {
"text": 8,
"video_presigned_url": 1
},
"error": {
"video_presigned_url": 1
}
},
"encoding_format_distribution": {
"float": 10
},
"status_codes": {
"200": 9,
"400": 1
},
"errors_by_classification": {
"download_error": 1
},
"latency": {
"p50": 0.12,
"p95": 0.30,
"p99": 0.35,
"p99.9": 0.35,
"min": 0.05,
"max": 0.40,
"avg": 0.15
},
"latency_by_type": {
"query": {
"p50": 0.10,
"p95": 0.20,
"p99": 0.25,
"avg": 0.12,
"count": 8
},
"bulk_video": {
"p50": 0.40,
"p95": 0.40,
"p99": 0.40,
"avg": 0.40,
"count": 1
}
},
"latency_by_input_type": {
"text": {
"p50": 0.10,
"p95": 0.20,
"p99": 0.25,
"avg": 0.12,
"count": 8
},
"video_presigned_url": {
"p50": 0.40,
"p95": 0.40,
"p99": 0.40,
"avg": 0.40,
"count": 1
}
},
"error_latency": {
"p50": 0.08,
"p95": 0.08,
"p99": 0.08,
"min": 0.08,
"max": 0.08,
"avg": 0.08,
"count": 1
},
"error_latency_by_type": {
"bulk_video": {
"p50": 0.08,
"p95": 0.08,
"p99": 0.08,
"avg": 0.08,
"count": 1
}
},
"error_latency_by_input_type": {
"video_presigned_url": {
"p50": 0.08,
"p95": 0.08,
"p99": 0.08,
"avg": 0.08,
"count": 1
}
},
"throughput": {
"requests_per_minute": {
"1min": 60.0,
"5min": 12.0,
"15min": 4.0
}
},
"business_metrics": {
"total_embeddings": 10,
"total_tokens": 100,
"total_videos": 2,
"total_video_frames": 16,
"failed_inputs_total": 0
},
"retries": {
"total": 1,
"success": 0,
"failure": 1,
"success_rate_percent": 0.0,
"last_retry_time": 1730000005.0
}
}
GET /v1/metadata#
Provides metadata about the NIM container, including version and model information.
GET /v1/manifest#
Returns the NIM manifest file content.
GET /v1/license#
Returns the license information for the NIM.
GET /v1/metrics#
Returns Prometheus-compatible metrics for monitoring.