Create Dataset#
Create a dataset to use in compatible customization jobs.
Prerequisites#
Before you can create a dataset, make sure that you have:
The Hugging Face CLI or SDK installed.
Your dataset files ready to upload, either locally or accessible from a remote location.
Access to a NeMo Data Store service URL.
Access to a NeMo Entity Store service URL.
How to Create a Dataset#
Hugging Face#
Creating a dataset programmatically requires two steps: uploading and registration.
Upload Dataset#
You can use both the Hugging Face CLI and SDK to create datasets for use in the NeMo Microservices Platform.
# Set the namespace and dataset name
NAMESPACE=<your_namespace>
DATASET_NAME=<your_dataset_name>
# Upload the dataset
huggingface-cli upload \
--repo-type dataset ${NAMESPACE}/${DATASET_NAME} \
/your/local/file remote/filename
from huggingface_hub import HfApi, upload_file
import os
# Configure microservice host URLs
DATA_STORE_BASE_URL = os.getenv("DATA_STORE_BASE_URL")
# Define entity details
NAMESPACE = os.getenv("NAMESPACE", "default")
DATASET_NAME = "test-dataset"
# Provide HF token
HF_TOKEN = os.getenv("HF_TOKEN")
try:
# Initialize Hugging Face API client
# Note: A valid token is required for most operations
hf_api = HfApi(endpoint=f"{DATA_STORE_BASE_URL}/v1/hf", token=HF_TOKEN)
# Set the dataset repository details
repo_id = f"{NAMESPACE}/{DATASET_NAME}"
path_to_local_file = "./data/training_data.json"
file_name = "training_data.json"
# Upload the dataset
# This will create the repository if it doesn't exist
hf_api.upload_file(
repo_type="dataset",
repo_id=repo_id,
revision="main",
path_or_fileobj=path_to_local_file,
path_in_repo=file_name
)
print(f"Successfully uploaded dataset to {repo_id}")
except Exception as e:
print(f"Error uploading dataset: {str(e)}")
raise
Register Dataset#
Register the dataset with NeMo Entity Store so it can be retrieved by other microservices. You must at least provide the name
, namespace
, and files_url
(formulated as hf://datasets/{namespace}/{dataset_name}
).
Make a POST request to the
/v1/datasets
endpoint.export ENTITY_STORE_BASE_URL=<URL for NeMo Entity Store> curl -X POST "${ENTITY_STORE_BASE_URL}/v1/datasets" \ -H 'Accept: application/json' \ -H 'Content-Type: application/json' \ -d '{ "name": "documentation-test-dataset", "namespace": "team-docs", "description": "A dataset for documentation testing", "format": "json", "files_url": "hf://datasets/{namespace}/{dataset-name}", "project": "documentation-test-project", "custom_fields": {}, "ownership": { "created_by": "user@nvidia.com", "access_policies": {} } }' | jq
# continuing from HF SDK example import requests import json # Create dataset in entity store try: entity_store_url = f"https://{ENTITY_STORE_BASE_URL}/v1/datasets" payload = { "name": DATASET_NAME, "namespace": NAMESPACE, "description": "Description of your dataset", "format": "json", # or other format of your dataset "files_url": f"hf://datasets/{repo_id}", "project": "your-project-name", "custom_fields": {}, "ownership": { "created_by": "user@domain.com", "access_policies": {} } } resp = requests.post(entity_store_url, json=payload) if resp.status_code in (200, 201): print(f"Successfully registered dataset: {resp.json()['id']}") elif resp.status_code == 409: print("Dataset already exists") elif resp.status_code == 422: print(f"Invalid request payload: {resp.json()}") else: print(f"Unexpected response: {resp.status_code}") resp.raise_for_status() except requests.exceptions.RequestException as e: print(f"Error registering dataset: {str(e)}") raise
Verify that the project was created by reviewing the response.
Example Response
{ "schema_version": "1.0", "id": "dataset-81RSQp7FKX3rdBtKvF9Skn", "description": "A dataset for documentation testing", "type_prefix": null, "namespace": "team-docs", "project": "string", "created_at": "2025-02-14T20:47:20.798490", "updated_at": "2025-02-14T20:47:20.798492", "custom_fields": {}, "ownership": { "created_by": "user@nvidia.com", "access_policies": {} }, "name": "documentation-test-dataset", "version_id": "main", "version_tags": [], "format": "json", "files_url": "hf://datasets/{namespace}/{dataset-name}" }