Create Dataset#
Upload a dataset to NeMo Data Store and register it to NeMo Entity Store.
Prerequisites#
Before you can create a dataset, make sure that you have:
- The Hugging Face CLI or SDK installed. 
- Your dataset files ready to upload, either locally or accessible from a remote location. 
- Access to a NeMo Data Store service URL. 
- Access to a NeMo Entity Store service URL. 
To Create a Dataset#
Upload your datasets to NeMo Data Store and register to NeMo Entity Store.
Upload Dataset Using Hugging Face CLI or SDK#
Creating a dataset programmatically requires two steps: uploading and registration.
You can use both the Hugging Face (HF) CLI and SDK to create datasets for use in the NeMo microservices platform.
# Set the namespace and dataset name
NAMESPACE=<your_namespace> # Namespace that you create using NeMo Entity Store
DATASET_NAME=<your_dataset_name>
# Upload the dataset
huggingface-cli upload \
 --repo-type dataset ${NAMESPACE}/${DATASET_NAME} \
 /your/local/file remote/filename
from huggingface_hub import HfApi, upload_file
import os
# Configure microservice host URLs
DATA_STORE_BASE_URL = os.getenv("DATA_STORE_BASE_URL")
# Define entity details
NAMESPACE = os.getenv("NAMESPACE", "default") # Namespace that you create using NeMo Entity Store
DATASET_NAME = "test-dataset"
# Provide HF token
HF_TOKEN = os.getenv("HF_TOKEN")
try:
   # Initialize Hugging Face API client
   # Note: A valid token is required for most operations
   hf_api = HfApi(endpoint=f"{DATA_STORE_BASE_URL}/v1/hf", token=HF_TOKEN)
   # Set the dataset repository details
   repo_id = f"{NAMESPACE}/{DATASET_NAME}"
   path_to_local_file = "./data/training_data.json"
   file_name = "training_data.json"  
   # Upload the dataset
   # This will create the repository if it doesn't exist
   hf_api.upload_file(
       repo_type="dataset",
       repo_id=repo_id,
       revision="main",
       path_or_fileobj=path_to_local_file,
       path_in_repo=file_name
   )
   print(f"Successfully uploaded dataset to {repo_id}")
except Exception as e:
   print(f"Error uploading dataset: {str(e)}")
   raise
Register Dataset to NeMo Entity Store#
Register the dataset with NeMo Entity Store so it can be retrieved by other microservices. You must at least provide the name, namespace, and files_url (formulated as hf://datasets/{namespace}/{dataset_name}).
Choose one of the following options of registering a dataset.
Set up a NeMoMicroservices client instance using the base URL of the NeMo Entity Store microservice and perform the task as follows.
from nemo_microservices import NeMoMicroservices
client = NeMoMicroservices(
    base_url=os.environ["ENTITY_STORE_BASE_URL"]
)
response = client.datasets.create(
    name="your-dataset-name",
    namespace="your-namespace", # Namespace that you create using NeMo Entity Store
    description="your-dataset-description",
    format="json",
    files_url="hf://datasets/your-namespace/your-dataset-name",
    project="your-project-name",
    custom_fields={},
    ownership={
        "created_by": "user@domain.com",
        "access_policies": {}
    }
)
print(response)
Make a POST request to the /v1/datasets endpoint.
export ENTITY_STORE_BASE_URL=<URL for NeMo Entity Store>
curl -X POST "${ENTITY_STORE_BASE_URL}/v1/datasets" \
    -H 'Accept: application/json' \
    -H 'Content-Type: application/json' \
    -d '{
    "name": "your-dataset-name",
    "namespace": "your-namespace", # Namespace that you create using NeMo Entity Store
    "description": "your-dataset-description",
    "format": "json",
    "files_url": "hf://datasets/{namespace}/{dataset-name}",
    "project": "entity-store-project",
    "custom_fields": {},
    "ownership": {
      "created_by": "user@domain.com",
      "access_policies": {}
    }
  }' | jq
Example Response
{
  "schema_version": "1.0",
  "id": "dataset-81RSQp7FKX3rdBtKvF9Skn",
  "description": "your-dataset-description",
  "type_prefix": null,
  "namespace": "your-namespace",
  "project": "string",
  "created_at": "2025-02-14T20:47:20.798490",
  "updated_at": "2025-02-14T20:47:20.798492",
  "custom_fields": {},
  "ownership": {
    "created_by": "your-email",
    "access_policies": {}
  },
  "name": "your-dataset-name",
  "version_id": "main",
  "version_tags": [],
  "format": "json",
  "files_url": "hf://datasets/{namespace}/{dataset_name}"
}
Next Steps#
Your dataset is now registered and ready to use. You can:
- Use in custom evaluations: Reference your dataset in RAG, Retriever, BFCL, or Agentic evaluation configurations using - hf://datasets/{namespace}/{dataset-name}
- Use in model fine-tuning: Include in customization jobs 
- Manage dataset lifecycle: Update, list, or delete your datasets as needed 
For complete examples, see the dataset management tutorials or the end-to-end evaluation workflow.