Create Dataset#

Create a dataset to use in compatible customization jobs.

Prerequisites#

Before you can create a dataset, make sure that you have:

  • The Hugging Face CLI or SDK installed.

  • Your dataset files ready to upload, either locally or accessible from a remote location.

  • Access to a NeMo Data Store service URL.

  • Access to a NeMo Entity Store service URL.


How to Create a Dataset#

Hugging Face#

Creating a dataset programmatically requires two steps: uploading and registration.

Upload Dataset#

You can use both the Hugging Face CLI and SDK to create datasets for use in the NeMo Microservices Platform.

# Set the namespace and dataset name
NAMESPACE=<your_namespace>
DATASET_NAME=<your_dataset_name>

# Upload the dataset
huggingface-cli upload \
 --repo-type dataset ${NAMESPACE}/${DATASET_NAME} \
 /your/local/file remote/filename
from huggingface_hub import HfApi, upload_file
import os

# Configure microservice host URLs
DATA_STORE_BASE_URL = os.getenv("DATA_STORE_BASE_URL")

# Define entity details
NAMESPACE = os.getenv("NAMESPACE", "default")
DATASET_NAME = "test-dataset"

# Provide HF token
HF_TOKEN = os.getenv("HF_TOKEN")

try:
   # Initialize Hugging Face API client
   # Note: A valid token is required for most operations
   hf_api = HfApi(endpoint=f"{DATA_STORE_BASE_URL}/v1/hf", token=HF_TOKEN)

   # Set the dataset repository details
   repo_id = f"{NAMESPACE}/{DATASET_NAME}"
   path_to_local_file = "./data/training_data.json"
   file_name = "training_data.json"  

   # Upload the dataset
   # This will create the repository if it doesn't exist
   hf_api.upload_file(
       repo_type="dataset",
       repo_id=repo_id,
       revision="main",
       path_or_fileobj=path_to_local_file,
       path_in_repo=file_name
   )
   print(f"Successfully uploaded dataset to {repo_id}")

except Exception as e:
   print(f"Error uploading dataset: {str(e)}")
   raise

Register Dataset#

Register the dataset with NeMo Entity Store so it can be retrieved by other microservices. You must at least provide the name, namespace, and files_url (formulated as hf://datasets/{namespace}/{dataset_name}).

  1. Make a POST request to the /v1/datasets endpoint.

    export ENTITY_STORE_BASE_URL=<URL for NeMo Entity Store>
    
    curl -X POST "${ENTITY_STORE_BASE_URL}/v1/datasets" \
        -H 'Accept: application/json' \
        -H 'Content-Type: application/json' \
        -d '{
         "name": "documentation-test-dataset",
         "namespace": "team-docs",
         "description": "A dataset for documentation testing",
         "format": "json",
         "files_url": "hf://datasets/{namespace}/{dataset-name}",
         "project": "documentation-test-project",
         "custom_fields": {},
         "ownership": {
           "created_by": "user@nvidia.com",
           "access_policies": {}
         }
       }' | jq
    
    # continuing from HF SDK example
    
    import requests
    import json
    
    # Create dataset in entity store
    try:
        entity_store_url = f"https://{ENTITY_STORE_BASE_URL}/v1/datasets"
        payload = {
            "name": DATASET_NAME,
            "namespace": NAMESPACE,
            "description": "Description of your dataset",
            "format": "json",  # or other format of your dataset
            "files_url": f"hf://datasets/{repo_id}",
            "project": "your-project-name",
            "custom_fields": {},
            "ownership": {
                "created_by": "user@domain.com",
                "access_policies": {}
            }
        }
    
        resp = requests.post(entity_store_url, json=payload)
        if resp.status_code in (200, 201):
            print(f"Successfully registered dataset: {resp.json()['id']}")
        elif resp.status_code == 409:
            print("Dataset already exists")
        elif resp.status_code == 422:
            print(f"Invalid request payload: {resp.json()}")
        else:
            print(f"Unexpected response: {resp.status_code}")
            resp.raise_for_status()
    
    except requests.exceptions.RequestException as e:
        print(f"Error registering dataset: {str(e)}")
        raise
    
  2. Verify that the project was created by reviewing the response.

    Example Response
    {
      "schema_version": "1.0",
      "id": "dataset-81RSQp7FKX3rdBtKvF9Skn",
      "description": "A dataset for documentation testing",
      "type_prefix": null,
      "namespace": "team-docs",
      "project": "string",
      "created_at": "2025-02-14T20:47:20.798490",
      "updated_at": "2025-02-14T20:47:20.798492",
      "custom_fields": {},
      "ownership": {
        "created_by": "user@nvidia.com",
        "access_policies": {}
      },
      "name": "documentation-test-dataset",
      "version_id": "main",
      "version_tags": [],
      "format": "json",
      "files_url": "hf://datasets/{namespace}/{dataset-name}"
    }