Overview#
Cosmos Embed1 is an NVIDIA Inference Microservice (NIM) that generates joint video-text embeddings for short-form videos, enabling tasks such as text-to-video retrieval, semantic deduplication, zero-shot classification, and k-nearest-neighbors search. It exposes a standard HTTP/REST API (compatible with OpenAI Embeddings API) and can be deployed either as a standalone downloadable container or as an NVCF (NVIDIA Cloud Functions) function.
It supports two primary modes of operation:
Downloadable NIM: Pull and run the container locally or in your cloud environment.
NVCF Function: Deploy the same service as a managed cloud function via NVIDIA Cloud Functions (NVCF).
Advantages of NIMs#
Unified Video-Text Embedding: Generates both video and text embeddings in a single, shared vector space, simplifying downstream tasks like cross-modal retrieval, semantic search, and zero-shot classification without needing separate pipelines.
Low-Latency Query Mode: Optimized “query” pipeline handles single or small batches of mixed text/video inputs with minimal overhead, delivering fast responses for interactive applications (e.g. chatbots, live search).
High-Throughput Bulk Mode: Dedicated bulk processor leverages multi-process GPU decoding (PyNVCodec) and CUDA-accelerated preprocessing (CV-CUDA) to sustain high embedding throughput on large video workloads. This is ideal for offline indexing and batch analytics.
Standards-Compliant HTTP API: Follows the OpenAI embeddings schema, so existing clients can integrate with minimal changes. Exposes familiar endpoints (
/v1/embeddings
, health checks, metadata) and interactive Swagger docs (/docs
).Seamless Deployment Options: Can be run as a self-hosted container or as an NVIDIA Cloud Function (NVCF), giving flexibility to deploy on-prem, at the edge, or in the cloud with identical behavior.
Automatic Error Recovery & Monitoring: Built-in health and readiness endpoints, standardized error codes, and integration with NVIDIA logging tools ensure that failures are detected and recoverable without manual intervention.
GPU-Accelerated Preprocessing: End-to-end GPU pipelines (from video decoding through embedding) minimize CPU-GPU data transfers and avoid Python GIL bottlenecks, maximizing hardware utilization and reducing end-to-end latency.
Model Variant and Embeddings#
This release uses the 224p variant of the Cosmos-Embed1 model and produces 256-dimensional output vectors.
Input videos in supported codecs and sizes are automatically resized by the NIM; no manual resizing is required.
Who is this NIM for?#
Teams building video search, dataset curation/deduplication, media recommendation, safety/taxonomy classification, and retrieval-augmented generation relying on cross-modal similarity.
Developers who want a production-grade service with consistent APIs and deployment portability (on-prem, edge, cloud).
Why Use This NIM Instead of the Hugging Face Model?#
Production-grade inference: TensorRT-LLM optimized pipelines, GPU-accelerated decoding, and CV-CUDA preprocessing for low latency and high throughput.
Operational APIs: OpenAI Embeddings-compatible HTTP surface with health, readiness, and metrics endpoints out of the box.
Portability & support: Run the same container locally, on Kubernetes, or as a managed NVCF function with NVIDIA supportability.
Standards & stability: Versioned images and predictable behavior across environments, without per-host Python/driver inconsistencies.