Overview#

Cosmos Embed1 is an NVIDIA Inference Microservice (NIM) that generates joint video-text embeddings for short-form videos, enabling tasks such as text-to-video retrieval, semantic deduplication, zero-shot classification, and k-nearest-neighbors search. It exposes a standard HTTP/REST API (compatible with OpenAI Embeddings API) and can be deployed either as a standalone downloadable container or as an NVCF (NVIDIA Cloud Functions) function.

It supports two primary modes of operation:

  • Downloadable NIM: Pull and run the container locally or in your cloud environment.

  • NVCF Function: Deploy the same service as a managed cloud function via NVIDIA Cloud Functions (NVCF).

Advantages of NIMs#

  • Unified Video-Text Embedding: Generates both video and text embeddings in a single, shared vector space, simplifying downstream tasks like cross-modal retrieval, semantic search, and zero-shot classification without needing separate pipelines.

  • Low-Latency Query Mode: Optimized “query” pipeline handles single or small batches of mixed text/video inputs with minimal overhead, delivering fast responses for interactive applications (e.g. chatbots, live search).

  • High-Throughput Bulk Mode: Dedicated bulk processor leverages multi-process GPU decoding (PyNVCodec) and CUDA-accelerated preprocessing (CV-CUDA) to sustain high embedding throughput on large video workloads. This is ideal for offline indexing and batch analytics.

  • Standards-Compliant HTTP API: Follows the OpenAI embeddings schema, so existing clients can integrate with minimal changes. Exposes familiar endpoints (/v1/embeddings, health checks, metadata) and interactive Swagger docs (/docs).

  • Seamless Deployment Options: Can be run as a self-hosted container or as an NVIDIA Cloud Function (NVCF), giving flexibility to deploy on-prem, at the edge, or in the cloud with identical behavior.

  • Automatic Error Recovery & Monitoring: Built-in health and readiness endpoints, standardized error codes, and integration with NVIDIA logging tools ensure that failures are detected and recoverable without manual intervention.

  • GPU-Accelerated Preprocessing: End-to-end GPU pipelines (from video decoding through embedding) minimize CPU-GPU data transfers and avoid Python GIL bottlenecks, maximizing hardware utilization and reducing end-to-end latency.

Model Variant and Embeddings#

  • This release uses the 224p variant of the Cosmos-Embed1 model and produces 256-dimensional output vectors.

  • Input videos in supported codecs and sizes are automatically resized by the NIM; no manual resizing is required.

Who is this NIM for?#

  • Teams building video search, dataset curation/deduplication, media recommendation, safety/taxonomy classification, and retrieval-augmented generation relying on cross-modal similarity.

  • Developers who want a production-grade service with consistent APIs and deployment portability (on-prem, edge, cloud).

Why Use This NIM Instead of the Hugging Face Model?#

  • Production-grade inference: TensorRT-LLM optimized pipelines, GPU-accelerated decoding, and CV-CUDA preprocessing for low latency and high throughput.

  • Operational APIs: OpenAI Embeddings-compatible HTTP surface with health, readiness, and metrics endpoints out of the box.

  • Portability & support: Run the same container locally, on Kubernetes, or as a managed NVCF function with NVIDIA supportability.

  • Standards & stability: Versioned images and predictable behavior across environments, without per-host Python/driver inconsistencies.