Create Dataset#
Upload a dataset to NeMo Data Store and register it to NeMo Entity Store.
Prerequisites#
Before you can create a dataset, make sure that you have:
The Hugging Face CLI or SDK installed.
Your dataset files ready to upload, either locally or accessible from a remote location.
Access to a NeMo Data Store service URL.
Access to a NeMo Entity Store service URL.
To Create a Dataset#
Upload your datasets to NeMo Data Store and register to NeMo Entity Store.
Upload Dataset Using Hugging Face CLI or SDK#
Creating a dataset programmatically requires two steps: uploading and registration.
You can use both the Hugging Face (HF) CLI and SDK to create datasets for use in the NeMo microservices platform.
# Set the namespace and dataset name
NAMESPACE=<your_namespace> # Namespace that you create using NeMo Entity Store
DATASET_NAME=<your_dataset_name>
# Upload the dataset
huggingface-cli upload \
--repo-type dataset ${NAMESPACE}/${DATASET_NAME} \
/your/local/file remote/filename
from huggingface_hub import HfApi, upload_file
import os
# Configure microservice host URLs
DATA_STORE_BASE_URL = os.getenv("DATA_STORE_BASE_URL")
# Define entity details
NAMESPACE = os.getenv("NAMESPACE", "default") # Namespace that you create using NeMo Entity Store
DATASET_NAME = "test-dataset"
# Provide HF token
HF_TOKEN = os.getenv("HF_TOKEN")
try:
# Initialize Hugging Face API client
# Note: A valid token is required for most operations
hf_api = HfApi(endpoint=f"{DATA_STORE_BASE_URL}/v1/hf", token=HF_TOKEN)
# Set the dataset repository details
repo_id = f"{NAMESPACE}/{DATASET_NAME}"
path_to_local_file = "./data/training_data.json"
file_name = "training_data.json"
# Upload the dataset
# This will create the repository if it doesn't exist
hf_api.upload_file(
repo_type="dataset",
repo_id=repo_id,
revision="main",
path_or_fileobj=path_to_local_file,
path_in_repo=file_name
)
print(f"Successfully uploaded dataset to {repo_id}")
except Exception as e:
print(f"Error uploading dataset: {str(e)}")
raise
Register Dataset to NeMo Entity Store#
Register the dataset with NeMo Entity Store so it can be retrieved by other microservices. You must at least provide the name
, namespace
, and files_url
(formulated as hf://datasets/{namespace}/{dataset_name}
).
Choose one of the following options of registering a dataset.
Set up a NeMoMicroservices
client instance using the base URL of the NeMo Entity Store microservice and perform the task as follows.
from nemo_microservices import NeMoMicroservices
client = NeMoMicroservices(
base_url=os.environ["ENTITY_STORE_BASE_URL"]
)
response = client.datasets.create(
name="your-dataset-name",
namespace="your-namespace", # Namespace that you create using NeMo Entity Store
description="your-dataset-description",
format="json",
files_url="hf://datasets/your-namespace/your-dataset-name",
project="your-project-name",
custom_fields={},
ownership={
"created_by": "user@domain.com",
"access_policies": {}
}
)
print(response)
Make a POST request to the /v1/datasets
endpoint.
export ENTITY_STORE_BASE_URL=<URL for NeMo Entity Store>
curl -X POST "${ENTITY_STORE_BASE_URL}/v1/datasets" \
-H 'Accept: application/json' \
-H 'Content-Type: application/json' \
-d '{
"name": "your-dataset-name",
"namespace": "your-namespace", # Namespace that you create using NeMo Entity Store
"description": "your-dataset-description",
"format": "json",
"files_url": "hf://datasets/{namespace}/{dataset-name}",
"project": "entity-store-project",
"custom_fields": {},
"ownership": {
"created_by": "user@domain.com",
"access_policies": {}
}
}' | jq
Example Response
{
"schema_version": "1.0",
"id": "dataset-81RSQp7FKX3rdBtKvF9Skn",
"description": "your-dataset-description",
"type_prefix": null,
"namespace": "your-namespace",
"project": "string",
"created_at": "2025-02-14T20:47:20.798490",
"updated_at": "2025-02-14T20:47:20.798492",
"custom_fields": {},
"ownership": {
"created_by": "your-email",
"access_policies": {}
},
"name": "your-dataset-name",
"version_id": "main",
"version_tags": [],
"format": "json",
"files_url": "hf://datasets/{namespace}/{dataset_name}"
}