Create Dataset#

Upload a dataset to NeMo Data Store and register it to NeMo Entity Store.

Prerequisites#

Before you can create a dataset, make sure that you have:

  • The Hugging Face CLI or SDK installed.

  • Your dataset files ready to upload, either locally or accessible from a remote location.

  • Access to a NeMo Data Store service URL.

  • Access to a NeMo Entity Store service URL.

To Create a Dataset#

Upload your datasets to NeMo Data Store and register to NeMo Entity Store.

Upload Dataset Using Hugging Face CLI or SDK#

Creating a dataset programmatically requires two steps: uploading and registration.

You can use both the Hugging Face (HF) CLI and SDK to create datasets for use in the NeMo microservices platform.

# Set the namespace and dataset name
NAMESPACE=<your_namespace> # Namespace that you create using NeMo Entity Store
DATASET_NAME=<your_dataset_name>

# Upload the dataset
huggingface-cli upload \
 --repo-type dataset ${NAMESPACE}/${DATASET_NAME} \
 /your/local/file remote/filename
from huggingface_hub import HfApi, upload_file
import os

# Configure microservice host URLs
DATA_STORE_BASE_URL = os.getenv("DATA_STORE_BASE_URL")

# Define entity details
NAMESPACE = os.getenv("NAMESPACE", "default") # Namespace that you create using NeMo Entity Store
DATASET_NAME = "test-dataset"

# Provide HF token
HF_TOKEN = os.getenv("HF_TOKEN")

try:
   # Initialize Hugging Face API client
   # Note: A valid token is required for most operations
   hf_api = HfApi(endpoint=f"{DATA_STORE_BASE_URL}/v1/hf", token=HF_TOKEN)

   # Set the dataset repository details
   repo_id = f"{NAMESPACE}/{DATASET_NAME}"
   path_to_local_file = "./data/training_data.json"
   file_name = "training_data.json"  

   # Upload the dataset
   # This will create the repository if it doesn't exist
   hf_api.upload_file(
       repo_type="dataset",
       repo_id=repo_id,
       revision="main",
       path_or_fileobj=path_to_local_file,
       path_in_repo=file_name
   )
   print(f"Successfully uploaded dataset to {repo_id}")

except Exception as e:
   print(f"Error uploading dataset: {str(e)}")
   raise

Register Dataset to NeMo Entity Store#

Register the dataset with NeMo Entity Store so it can be retrieved by other microservices. You must at least provide the name, namespace, and files_url (formulated as hf://datasets/{namespace}/{dataset_name}).

Choose one of the following options of registering a dataset.

Set up a NeMoMicroservices client instance using the base URL of the NeMo Entity Store microservice and perform the task as follows.

from nemo_microservices import NeMoMicroservices

client = NeMoMicroservices(
    base_url=os.environ["ENTITY_STORE_BASE_URL"]
)

response = client.datasets.create(
    name="your-dataset-name",
    namespace="your-namespace", # Namespace that you create using NeMo Entity Store
    description="your-dataset-description",
    format="json",
    files_url="hf://datasets/your-namespace/your-dataset-name",
    project="your-project-name",
    custom_fields={},
    ownership={
        "created_by": "user@domain.com",
        "access_policies": {}
    }
)
print(response)

Make a POST request to the /v1/datasets endpoint.

export ENTITY_STORE_BASE_URL=<URL for NeMo Entity Store>

curl -X POST "${ENTITY_STORE_BASE_URL}/v1/datasets" \
    -H 'Accept: application/json' \
    -H 'Content-Type: application/json' \
    -d '{
    "name": "your-dataset-name",
    "namespace": "your-namespace", # Namespace that you create using NeMo Entity Store
    "description": "your-dataset-description",
    "format": "json",
    "files_url": "hf://datasets/{namespace}/{dataset-name}",
    "project": "entity-store-project",
    "custom_fields": {},
    "ownership": {
      "created_by": "user@domain.com",
      "access_policies": {}
    }
  }' | jq
Example Response
{
  "schema_version": "1.0",
  "id": "dataset-81RSQp7FKX3rdBtKvF9Skn",
  "description": "your-dataset-description",
  "type_prefix": null,
  "namespace": "your-namespace",
  "project": "string",
  "created_at": "2025-02-14T20:47:20.798490",
  "updated_at": "2025-02-14T20:47:20.798492",
  "custom_fields": {},
  "ownership": {
    "created_by": "your-email",
    "access_policies": {}
  },
  "name": "your-dataset-name",
  "version_id": "main",
  "version_tags": [],
  "format": "json",
  "files_url": "hf://datasets/{namespace}/{dataset_name}"
}