Create Dataset Files#
Note
The time to complete this tutorial is approximately 15 minutes.
Learn how to manage dataset files using NeMo Data Store APIs. In this tutorial, you’ll become familiar with:
- Creating a repository in the NeMo Data Store. 
- Uploading dataset files to the repository. 
Prerequisites#
Before you can manage dataset files, make sure that you have:
- The NeMo Data Store microservice running on your cluster. 
- Completed the Set Up Organizational Entities tutorial. 
Set Up Environment Variables#
Set the following environment variables for storing the NeMo Data Store microservice base URL and your Hugging Face token.
DATA_STORE_BASE_URL="<NeMo Data Store base URL>"
HF_ENDPOINT="${DATA_STORE_BASE_URL}/v1/hf"
HF_TOKEN="your-hugging-face-token"
For example, if you continue to use the minikube cluster from Demo Cluster Setup on Minikube, you can use the following values for the environment variables:
DATA_STORE_BASE_URL="http://data-store.test"
HF_ENDPOINT="http://data-store.test/v1/hf"
HF_TOKEN="your-hugging-face-token"
Set Up Hugging Face Client with NeMo Data Store#
To set up the Hugging Face client with the NeMo Data Store API endpoint, run the following code.
from huggingface_hub import HfApi
from dotenv import load_dotenv
# Load environment variables created in previous tutorial
load_dotenv()
hf_api = HfApi(endpoint=HF_ENDPOINT, token=HF_TOKEN)
Create a Dataset#
In the previous tutorial, we created project and dataset entity objects in the NeMo Entity Store and stored them as environment variables, along with our chosen namespace. Now, we can create a dataset repository in the NeMo Data Store and upload our dataset files to it.
Create a Repo#
To create a dataset repository:
- Use the - create_repo()method from the Hugging Face Hub API.
- Specify your - DATASET_IDand set the repo type to- dataset.
hf_api.create_repo(DATASET_ID, repo_type="dataset")
Upload Files#
To upload your dataset files to the repository:
- Create a function that handles the upload process: - Takes - hf_apiand- dataset_idas parameters
- Includes error handling for upload failures 
 
- Upload each data folder using - upload_folder():- Training data goes to the “training” directory 
- Validation data goes to the “validation” directory 
- Testing data goes to the “testing” directory 
 
def upload_dataset_files(hf_api, dataset_id):
    try:
        # Upload training data
        hf_api.upload_folder(
            folder_path="./tmp/sample_test_data/training",
            path_in_repo="training",
            repo_id=dataset_id,
            repo_type="dataset"
        )
        
        # Upload validation data
        hf_api.upload_folder(
            folder_path="./tmp/sample_test_data/validation",
            path_in_repo="validation",
            repo_id=dataset_id,
            repo_type="dataset"
        )
        
        # Upload testing data
        hf_api.upload_folder(
            folder_path="./tmp/sample_test_data/testing",
            path_in_repo="testing",
            repo_id=dataset_id,
            repo_type="dataset"
        )
    except Exception as e:
        print(f"Error uploading files: {e}")
# Execute the upload function
upload_dataset_files(hf_api, dataset_id)
Note
Ensure your local folders contain the appropriate data files before running this code. The upload process may take several minutes depending on your file sizes and internet connection.
Next Steps#
Now that you’ve learned how to upload dataset files to the NeMo Data Store, you can use your registered datasets for:
- Model fine-tuning: Proceed to the customization tutorials 
- Custom evaluations: Use with RAG, Retriever, BFCL, or Agentic evaluation flows 
- End-to-end workflow: Follow the complete evaluation tutorial 
To learn more details about the NeMo Data Store and NeMo Entity Store microservices APIs, refer to the API documentation.