Create Dataset Files#
Note
The time to complete this tutorial is approximately 15 minutes.
Learn how to manage dataset files using NeMo Data Store APIs. In this tutorial, you’ll become familiar with:
Creating a repository in the NeMo Data Store.
Uploading dataset files to the repository.
Prerequisites#
Before you can manage dataset files, make sure that you have:
The NeMo Data Store microservice running on your cluster.
Completed the Set Up Organizational Entities tutorial.
Set Up Environment Variables#
Set the following environment variables for storing the NeMo Data Store microservice base URL and your Hugging Face token.
DATA_STORE_BASE_URL="<NeMo Data Store base URL>"
HF_ENDPOINT="${DATA_STORE_BASE_URL}/v1/hf"
HF_TOKEN="your-hugging-face-token"
For example, if you continue to use the minikube cluster from Beginner Tutorial Prerequisites, you can use the following values for the environment variables:
DATA_STORE_BASE_URL="http://data-store.test"
HF_ENDPOINT="http://data-store.test/v1/hf"
HF_TOKEN="your-hugging-face-token"
Set Up Hugging Face Client with NeMo Data Store#
To set up the Hugging Face client with the NeMo Data Store API endpoint, run the following code.
from huggingface_hub import HfApi
from dotenv import load_dotenv
# Load environment variables created in previous tutorial
load_dotenv()
hf_api = HfApi(endpoint=HF_ENDPOINT, token=HF_TOKEN)
Create a Dataset#
In the previous tutorial, we created project and dataset entity objects in the NeMo Entity Store and stored them as environment variables, along with our chosen namespace. Now, we can create a dataset repository in the NeMo Data Store and upload our dataset files to it.
Create a Repo#
To create a dataset repository:
Use the
create_repo()
method from the Hugging Face Hub API.Specify your
DATASET_ID
and set the repo type todataset
.
hf_api.create_repo(DATASET_ID, repo_type="dataset")
Upload Files#
To upload your dataset files to the repository:
Create a function that handles the upload process:
Takes
hf_api
anddataset_id
as parametersIncludes error handling for upload failures
Upload each data folder using
upload_folder()
:Training data goes to the “training” directory
Validation data goes to the “validation” directory
Testing data goes to the “testing” directory
def upload_dataset_files(hf_api, dataset_id):
try:
# Upload training data
hf_api.upload_folder(
folder_path="./tmp/sample_test_data/training",
path_in_repo="training",
repo_id=dataset_id,
repo_type="dataset"
)
# Upload validation data
hf_api.upload_folder(
folder_path="./tmp/sample_test_data/validation",
path_in_repo="validation",
repo_id=dataset_id,
repo_type="dataset"
)
# Upload testing data
hf_api.upload_folder(
folder_path="./tmp/sample_test_data/testing",
path_in_repo="testing",
repo_id=dataset_id,
repo_type="dataset"
)
except Exception as e:
print(f"Error uploading files: {e}")
# Execute the upload function
upload_dataset_files(hf_api, dataset_id)
Note
Ensure your local folders contain the appropriate data files before running this code. The upload process may take several minutes depending on your file sizes and internet connection.
Next Steps#
Now that you’ve learned how to upload dataset files to the NeMo Data Store, you can proceed to the model fine-tuning tutorial with NeMo Customizer.
To learn more details about the NeMo Data Store and NeMo Entity Store microservices APIs, use the