Create Dataset Files#

Note

The time to complete this tutorial is approximately 15 minutes.

Learn how to manage dataset files using NeMo Data Store APIs. In this tutorial, you’ll become familiar with:

Creating a repository in the NeMo Data Store.
Uploading dataset files to the repository.

Prerequisites#

Before you can manage dataset files, make sure that you have:

The NeMo Data Store microservice running on your cluster.
Completed the Set Up Organizational Entities tutorial.

Set Up Environment Variables#

Set the following environment variables for storing the NeMo Data Store microservice base URL and your Hugging Face token.

DATA_STORE_BASE_URL="<NeMo Data Store base URL>"

HF_ENDPOINT="${DATA_STORE_BASE_URL}/v1/hf"
HF_TOKEN="your-hugging-face-token"

For example, if you continue to use the minikube cluster from Demo Cluster Setup on Minikube, you can use the following values for the environment variables:

DATA_STORE_BASE_URL="http://data-store.test"

HF_ENDPOINT="http://data-store.test/v1/hf"
HF_TOKEN="your-hugging-face-token"

Set Up Hugging Face Client with NeMo Data Store#

To set up the Hugging Face client with the NeMo Data Store API endpoint, run the following code.

from huggingface_hub import HfApi
from dotenv import load_dotenv

# Load environment variables created in previous tutorial
load_dotenv()

hf_api = HfApi(endpoint=HF_ENDPOINT, token=HF_TOKEN)

Create a Dataset#

In the previous tutorial, we created project and dataset entity objects in the NeMo Entity Store and stored them as environment variables, along with our chosen namespace. Now, we can create a dataset repository in the NeMo Data Store and upload our dataset files to it.

Create a Repo#

To create a dataset repository:

Use the create_repo() method from the Hugging Face Hub API.
Specify your DATASET_ID and set the repo type to dataset.

hf_api.create_repo(DATASET_ID, repo_type="dataset")

Upload Files#

To upload your dataset files to the repository:

Create a function that handles the upload process:
- Takes hf_api and dataset_id as parameters
- Includes error handling for upload failures
Upload each data folder using upload_folder():
- Training data goes to the “training” directory
- Validation data goes to the “validation” directory
- Testing data goes to the “testing” directory

def upload_dataset_files(hf_api, dataset_id):
    try:
        # Upload training data
        hf_api.upload_folder(
            folder_path="./tmp/sample_test_data/training",
            path_in_repo="training",
            repo_id=dataset_id,
            repo_type="dataset"
        )
        
        # Upload validation data
        hf_api.upload_folder(
            folder_path="./tmp/sample_test_data/validation",
            path_in_repo="validation",
            repo_id=dataset_id,
            repo_type="dataset"
        )
        
        # Upload testing data
        hf_api.upload_folder(
            folder_path="./tmp/sample_test_data/testing",
            path_in_repo="testing",
            repo_id=dataset_id,
            repo_type="dataset"
        )
    except Exception as e:
        print(f"Error uploading files: {e}")

# Execute the upload function
upload_dataset_files(hf_api, dataset_id)

Note

Ensure your local folders contain the appropriate data files before running this code. The upload process may take several minutes depending on your file sizes and internet connection.

Next Steps#

Now that you’ve learned how to upload dataset files to the NeMo Data Store, you can proceed to the model fine-tuning tutorial with NeMo Customizer.

To learn more details about the NeMo Data Store and NeMo Entity Store microservices APIs, use the