Data Loading Concepts#
This guide covers the core concepts for loading and managing text data from local files in NVIDIA NeMo Curator.
DocumentDataset#
DocumentDataset
is the foundation for handling large-scale text data processing in NeMo Curator. It is built on top of Dask
DataFrames (dd.DataFrame
) to enable distributed processing of local files.
Feature |
Description |
---|---|
Lazy Loading & Memory Management |
|
GPU Acceleration |
|
Robust Processing |
|
Usage Examples
# Creating DocumentDataset from different sources
from nemo_curator.datasets import DocumentDataset
# Read JSONL files
dataset = DocumentDataset.read_json("data.jsonl")
# Read Parquet files with GPU acceleration
gpu_dataset = DocumentDataset.read_parquet(
"data.parquet",
backend="cudf" # Enable GPU acceleration
)
# Read multiple files
dataset = DocumentDataset.read_json([
"data1.jsonl",
"data2.jsonl"
])
# Basic operations
print(f"Dataset size: {len(dataset)}")
sample_data = dataset.head(10) # Get first 10 rows
persisted = dataset.persist() # Persist in memory
repartitioned = dataset.repartition(npartitions=4) # Repartition
# Convert to pandas for local processing
pandas_df = dataset.to_pandas()
ParallelDataset#
ParallelDataset
extends DocumentDataset
to handle parallel text data, particularly for machine translation and cross-lingual tasks.
Feature |
Description |
---|---|
Parallel Text Processing |
|
Quality Filters |
|
Output Formats |
|
Usage Examples
# Loading parallel text files (single pair)
from nemo_curator.datasets import ParallelDataset
dataset = ParallelDataset.read_simple_bitext(
src_input_files="data.en",
tgt_input_files="data.de",
src_lang="en",
tgt_lang="de"
)
# Multiple file pairs
dataset = ParallelDataset.read_simple_bitext(
src_input_files=["train.en", "dev.en"],
tgt_input_files=["train.de", "dev.de"],
src_lang="en",
tgt_lang="de"
)
# Apply length ratio filter
from nemo_curator.filters import LengthRatioFilter
length_filter = LengthRatioFilter(max_ratio=3.0)
filtered_dataset = length_filter(dataset)
# Export processed data
dataset.to_bitext(
output_file_dir="processed_data/",
write_to_filename=True
)
Supported File Formats#
DocumentDataset supports multiple file formats for loading text data from local files:
JSON Lines format - Most commonly used format for text datasets in NeMo Curator.
# Single file
dataset = DocumentDataset.read_json("data.jsonl")
# Multiple files
dataset = DocumentDataset.read_json([
"file1.jsonl",
"file2.jsonl"
])
# Directory of files
dataset = DocumentDataset.read_json("data_directory/")
# Performance optimization with column selection
dataset = DocumentDataset.read_json(
"data.jsonl",
columns=["text", "id"]
)
most-common fast-loading
Columnar format - Better performance for large datasets and GPU acceleration.
# Basic Parquet reading
dataset = DocumentDataset.read_parquet("data.parquet")
# GPU acceleration (recommended for production)
dataset = DocumentDataset.read_parquet(
"data.parquet",
backend="cudf"
)
# Column selection for better performance
dataset = DocumentDataset.read_parquet(
"data.parquet",
columns=["text", "metadata"]
)
production gpu-optimized
Python serialization - For preserving complex data structures.
# Read pickle files
dataset = DocumentDataset.read_pickle("data.pkl")
# Multiple pickle files
dataset = DocumentDataset.read_pickle([
"data1.pkl",
"data2.pkl"
])
python-native object-preservation
Custom formats - Extensible framework for specialized file readers.
# Custom file format
dataset = DocumentDataset.read_custom(
input_files="custom_data.ext",
file_type="ext",
read_func_single_partition=my_custom_reader,
backend="pandas"
)
extensible specialized
Data Export Options#
NeMo Curator provides flexible export options for processed datasets:
JSON Lines export - Human-readable format for text datasets.
# Basic export
dataset.to_json("output_directory/")
# Export with filename preservation
dataset.to_json(
"output_directory/",
write_to_filename=True,
keep_filename_column=True
)
# Partitioned export
dataset.to_json(
"output_directory/",
partition_on="language"
)
human-readable debugging-friendly
Parquet export - Optimized columnar format for production workflows.
# Basic export
dataset.to_parquet("output_directory/")
# Export with partitioning
dataset.to_parquet(
"output_directory/",
partition_on="category"
)
# GPU-accelerated export
dataset.to_parquet(
"output_directory/",
backend="cudf"
)
high-performance production-ready
Common Loading Patterns#
Loading from multiple sources - Combine data from different locations and formats.
# Combine multiple directories
dataset = DocumentDataset.read_json([
"dataset_v1/",
"dataset_v2/",
"additional_data/"
])
# Mix file types (not recommended, convert to consistent format first)
jsonl_data = DocumentDataset.read_json("text_data.jsonl")
parquet_data = DocumentDataset.read_parquet("structured_data.parquet")
# Combine datasets after loading
combined = DocumentDataset.from_pandas(
pd.concat([jsonl_data.to_pandas(), parquet_data.to_pandas()])
)
data-aggregation multi-source
Performance optimization - Maximize throughput and minimize memory usage.
# Optimize for GPU processing
dataset = DocumentDataset.read_parquet(
"large_dataset.parquet",
backend="cudf",
columns=["text", "id"], # Only load needed columns
files_per_partition=4 # Optimize partition size
)
# Optimize memory usage
dataset = DocumentDataset.read_json(
"data.jsonl",
blocksize="512MB", # Adjust based on available memory
backend="pandas"
)
# Parallel loading with custom partition size
dataset = DocumentDataset.read_parquet(
"data/",
npartitions=16 # Match CPU/GPU count
)
high-performance memory-efficient
Working with large datasets - Handle massive datasets efficiently.
# Efficient processing for large datasets
dataset = DocumentDataset.read_parquet("massive_dataset/")
# Persist in memory for repeated operations
dataset = dataset.persist()
# Repartition for optimal processing
dataset = dataset.repartition(npartitions=8)
# Process in chunks
for partition in dataset.to_delayed():
# Process each partition separately
result = partition.compute()
# Lazy evaluation for memory efficiency
dataset = dataset.map_partitions(
lambda df: df.head(1000) # Process only first 1000 rows per partition
)
scalable memory-conscious
Remote Data Acquisition#
For users who need to download and process data from remote sources, NeMo Curator provides a comprehensive data acquisition framework. This is covered in detail in Data Acquisition Concepts, which includes:
DocumentDownloader, DocumentIterator, DocumentExtractor components
Built-in support for Common Crawl, ArXiv, Wikipedia, and custom sources
Integration patterns with DocumentDataset
Configuration and scaling strategies
The data acquisition process produces standard DocumentDataset
objects that integrate seamlessly with the local file loading concepts covered on this page.