Creating a Dataset#

The first step for curating a dataset is providing the dataset as an S3 bucket or uploading the dataset directly as a ZIP file.

Once you’ve provided your dataset, NeMo Curator on DGX Cloud will process the dataset and optimize it for AI applications–this process is often referred to as curation.

Dataset Guidelines#

Videos in your dataset should be formatted as follows:

  • Format: MP4

  • Color Space: RGB

  • Resolution: 1280x720 – The curator will resize videos to this resolution.

  • Frame Rate: 24 FPS – The curator will resample videos to this frame rate.

Adding a Dataset#

Follow these steps to add a dataset to your Cosmos Managed Service account:

  1. Click the Datasets option from the left navbar.

  2. Enter the following information in the Curate New Dataset section:

    • Dataset Name: Enter name for the dataset

    _images/create-dataset-1.png
  3. Under the Data Source section, add your dataset by either linking to an AWS S3 bucket or uploading a ZIP file.

    • Bring Your S3: Follow these steps to link the dataset to an S3 bucket:

      1. Paste the credentials from your AWS credentials file to the field. Optionally, you can click the Show Credentials box to ensure the credentials are valid.

      2. Enter the Input S3 Prefix where the dataset files are stored.

      3. Enter the Output S3 Prefix where processed data will be stored.

        Important

        When using an AWS Access Key, ensure you provide only the minimum S3 permissions required for NeMo Curator on DGX Cloud operations. The curator service should have read-only permissions for the input data bucket and read/write permissions for the output data bucket. The AWS Access Key should not provide read/write permissions to any buckets except those associated with dataset input/output operations.

    • Upload ZIP File

      1. To upload the dataset as a ZIP file, drag and drop the ZIP file to the Click or Drag to Uplaod File box. You can also click the field to use a local file explorer to find your ZIP file.

        Note

        The maximum ZIP file size for upload is 20GB.

  4. Under the Pipeline Definition section, specify the following:

    • Video Type: Select an option for dataset curation from the dropdown list. The option you choose will determine the JSON curation parameters in the text editor–you can then modify the parameters as desired to further match curation to your use case.

      • Default: A general text prompt is used, and the curator service generates captions for each video using the prompt. The transnetv2 algorithm is used to determine how to segment videos with cuts/transitions.

      • Autonomous Vehicle: A text prompt specific to recordings from an autonomous vehicle camera is provided. The curator service will generate captions for each video using the prompt. A fixed-stride splitting algorithm is used to segment videos into segments with uniform length.

      • Fixed Camera: A text prompt specific to recordings from a fixed camera (e.g. a surveillance camera) is provided. The curator will generate captions for each video using the prompt. A fixed-stride splitting algorithm is used to segment videos into segments with uniform length.

    Refer to the Curation Parameters page for a description of all available curation pipeline parameters.

  5. Click the Start Curation button to begin processing the dataset.

Managing Datasets#

Once you’ve added a dataset, you can view it in the table at the bottom of the Datasets page.

_images/create-dataset-3.png

Click the My Datasets tab to view datasets creating by your user account, or click the All Datasets tab to view datasets created by all users in your organization.

  • Name: The name of the dataset

  • Type: The storage modality of the dataset–either “S3” or “ZipFile”

  • User Name: The name of the user who created the dataset

  • ID: A unique identifier for the dataset

  • Date Added: The date the dataset was created

  • Status: The processing status of the dataset. Once the status reads as “PROCESSSED”, you can continue with other actions like viewing the generated captions.

  • Actions: Click the modal to perform actions with the dataset.

    • Show Captions: View the captions generated for each video, along with a video preview.

    • Download Original Dataset: Download the dataset in its original form–without augmentation from the curator service.

    • Download Clips and Captions: Download the curated dataset, which contains segmented videos and generated captions. Refer to the Curated Dataset Structure page for more details.

    • Delete Dataset: Delete the dataset from your Cosmos Managed Service account.