Overview#
Semantic Split C-RADIO is a Computer Vision (CV) NIM designed to process one or more videos and extract scenes of interest, providing semantically distinct data that can be leveraged to train video models.
The Semantic Split C-RADIO NIM provides gpu-accelerated capabilities to detect scene transitions and filter out still and duplicate scenes. It can decode various input formats and re-encode them into an homogeneous format.
The Semantic Split C-RADIO NIM follows the algorithmic design from the Panda-70M paper. It leverages the following:
PyNvVideoCodec for accelerated decoding and encoding capabilities and CV-CUDA for video pre-processing
A gpu-accelerated version of PySceneDetect for initial detection of scenes (splitting)
The commercial RADIO model for semantic understanding (frame embeddings), to perform scene stitching and filtering (filtering of scene transitions, still and redundant scenes)
FFMPEG to read the video metadata, perform audio muxing and file splitting (no decoding/encoding)
Advantages of NIMs#
NIMs offer a simple and easy-to-deploy route for self-hosted AI applications. Two major advantages that NIMs offer for system administrators and developers are the following:
Increased productivity — NIMs allow developers to build generative AI applications quickly, in minutes rather than weeks, by providing a standardized way to add AI capabilities to their applications.
Simplified deployment — NIMs provide containers that can be easily deployed on various platforms, including clouds, data centers, or workstations, making it convenient for developers to test, deploy and scale their applications.
This NIM provides a fast, efficient set of methods behind a consistent API for processing large volumes of video data into semantically distinct video clips that can be leveraged to train video models.
Limitations of Early Access Release#
The following are the limitations of this Early Access release:
Limited to only certain hardware
Only tested on hardware that has NvDecoder and NvEncoder. Current configuration is optimized for L40s and L20.
Limited output format support
Only H.264 MP4 is supported. The resolution must be the same as the input.
Limited input format support
See Supported Formats.
Limited in reporting current progress
Only one call to the video splitting end-point is possible at a time.
All requests are synchronous blocking requests.
No partial or streaming responses are returned. The client must wait until the full request is processed.
Limited load balancing
For optimal performance and HW utilization each video split request should meet the following guidelines:
Input Paths should be a multiple of 6, with recommended sizes of [36,42,..,126]
Videos should be of roughly the same overall length.
Requests that do not conform to the above guidelines should be performed correctly provided all other requirements are met but may deliver sub-optimal performance.
No support for videos with a Variable Frame Rate(VFR)
Videos with VFR produce undefined behavior.
ffmpeg can be used to check if a video has a variable frame rate
Refer to section Troubleshooting for instructions to check if a video has VFR.
Not optimized for long form video processing
Early access release has been optimized for processing short form videos of length less than 30 mintues.
Long form videos of length greater than 30 minutes may deliver sub-optimal performance.