Introduction#

NeMo Curator on DGX Cloud provides a cloud-based, GPU-accelerated solution for curating video datasets for AI workflows. This curator service can automatically segment videos of various lengths into semantically consistent clips, generate embedding data, and create text prompts for each video.

You can provide your video datasets in two ways:

AWS S3: Link two AWS S3 buckets to the NeMo Curator on DGX Cloud service: The input bucket contains your original video dataset; the output bucket will contain the curated dataset generated by the curator service.
ZIP File: Provide your video dataset as a ZIP upload. The NeMo Curator on DGX Cloud service will store the curated dataset on NVIDIA DGX Cloud.

Furthermore, you can utilize NeMo Curator on DGX Cloud in two ways:

UI: Provide your dataset and configure the curation process using the NGC WebUI.
API: Communicate with the NeMo Curator on DGX Cloud API to perform dataset curation programmatically.

Curation Pipeline Overview#

This diagram provides a high-level outline of the video curation architecture. NeMo Curator offers a collection of pipelines that read/write video data and metadata from/to DGX Cloud or S3 storage. These pipelines use Ray for multi-node, multi-GPU scaling to stream the data through the pipeline efficiently. All computational stages are GPU-accelerated using state-of-the-art NVIDIA libraries to ensure maximum throughput.

Furthermore, the pipelines are optimized so that each stage has the appropriate number of workers to prevent bottlenecks. For example, in the splitting pipeline, the captioning stage is computationally intensive and has a lower throughput than other stages. To compensate, the autoscaling system automatically creates more workers for the captioning stage, increasing its throughput and reducing bottlenecks.