Adding a Custom Dataset Loader#
Note
We recommend reading the Evaluating NeMo Agent Toolkit Workflows guide before proceeding with this detailed documentation.
NeMo Agent Toolkit provides built-in dataset loaders for common file formats (json, jsonl, csv, xls, parquet, and custom). In addition, the toolkit provides a plugin system to add custom dataset loaders for new file formats or data sources.
Summary#
This guide provides a step-by-step process to create and register a custom dataset loader with NeMo Agent Toolkit. A TSV (tab-separated values) dataset loader is used as an example to demonstrate the process.
Existing Dataset Loaders#
You can view the list of existing dataset loaders by running the following command:
nat info components -t dataset_loader
Extending NeMo Agent Toolkit with Custom Dataset Loaders#
To extend NeMo Agent Toolkit with custom dataset loaders, you need to create a dataset loader configuration class and a registration function, then register it with NeMo Agent Toolkit using the register_dataset_loader decorator.
Dataset Loader Configuration#
The dataset loader configuration defines the dataset type name and any format-specific parameters. This configuration is paired with a registration function that yields a DatasetLoaderInfo object containing the load function.
The following example shows how to define and register a custom dataset loader for TSV files:
# my_plugin/dataset_loader_register.py
import pandas as pd
from pydantic import Field
from nat.builder.builder import EvalBuilder
from nat.builder.dataset_loader import DatasetLoaderInfo
from nat.cli.register_workflow import register_dataset_loader
from nat.data_models.dataset_handler import EvalDatasetBaseConfig
class EvalDatasetTsvConfig(EvalDatasetBaseConfig, name="tsv"):
"""Configuration for TSV dataset loader."""
separator: str = Field(default="\t", description="Column separator character.")
@register_dataset_loader(config_type=EvalDatasetTsvConfig)
async def register_tsv_dataset_loader(config: EvalDatasetTsvConfig, builder: EvalBuilder):
"""Register TSV dataset loader."""
def load_tsv(file_path, **kwargs):
return pd.read_csv(file_path, sep=config.separator, **kwargs)
yield DatasetLoaderInfo(config=config, load_fn=load_tsv, description="TSV file dataset loader")
The
EvalDatasetTsvConfigclass extendsEvalDatasetBaseConfigwith thename="tsv"parameter, which sets the_typevalue used in YAML configuration files.The
register_tsv_dataset_loaderfunction uses the@register_dataset_loaderdecorator to register the dataset loader with NeMo Agent Toolkit.The function yields a
DatasetLoaderInfoobject, which binds the config, load function, and a human-readable description.
Understanding DatasetLoaderInfo#
The DatasetLoaderInfo class contains the following fields:
config: The dataset loader configuration object (an instance ofEvalDatasetBaseConfigor a subclass).load_fn: A callable that takes a file path and optional keyword arguments and returns apandas.DataFrame. This function is used by the evaluation framework to load the dataset.description: A human-readable description of the dataset loader.
Importing for Registration#
To ensure the dataset loader is registered at runtime, import the registration function in your project’s register.py file – even if the function is not called directly.
# my_plugin/register.py
from .dataset_loader_register import register_tsv_dataset_loader
Entry Point#
Add an entry point in your pyproject.toml so that NeMo Agent Toolkit discovers the plugin automatically:
[project.entry-points.'nat.plugins']
my_plugin = "my_plugin.register"
Display All Dataset Loaders#
To display all registered dataset loaders, run the following command:
nat info components -t dataset_loader
This will now display the custom dataset loader tsv in the list of dataset loaders.
Using the Custom Dataset Loader#
Once registered, you can use the custom dataset loader in your evaluation configuration:
eval:
general:
dataset:
_type: tsv
file_path: <path to your file>
separator: "\t"
The _type field specifies the dataset loader name. All fields defined in the configuration class are available as YAML keys.
Running the Evaluation#
Run the evaluation using the standard command:
nat eval --config_file <path to file>
Built-in Dataset Loaders#
The following dataset loaders are included with NeMo Agent Toolkit:
Type |
Description |
Load Function |
|---|---|---|
|
JSON file dataset |
|
|
JSON Lines file dataset |
Custom JSONL reader |
|
CSV file dataset |
|
|
Parquet file dataset |
|
|
Excel file dataset |
|
|
Custom parser function |
User-provided function via |
For more details on the built-in dataset formats and their configuration options, see the Using Datasets section in the evaluation guide.