Built-in Readers#

PhysicsNeMo’s datapipe readers are an abstracted interface for enabling data loading from various sources into the datapipe framework. By providing a common interface and API to implement, users can easily implement new dataset readers and plug them into existing datapipes.

Base Reader#

Each reader class should inherit from the base Reader, below. Users should implement at minimum two functions: _load_sample, which takes an integer index defining an index into the dataset, and returns a dictionary of CPU tensors from the dataset. Note that you do not need to move the tensors to the GPU - it will be handled automatically.

Additionally, users must implement the __len__ method to return the length of the dataset.

At configuration, each reader or subclass should configure pin_memory to true or false to set CPU memory pinning. This enables faster, async data transfer from host to device, sometimes at the cost of higher CPU resource usage.

The Reader abstraction has configurable support for dataset metadata that will not be passed through the preprocessing pipeline, but can be optionally consumed in a training or inference loop. To control the precise way to fetch and return metadata, override the _get_sample_metadata class.

For some datasets, such as very high resolution volumetric datasets that will get downsampled at training time, the Reader classes provide a fast-path, only-read-what-you-need optimization called “coordinated_subsampling”. In essence, if your input and output fields are both 1 billion points, but you only will consume 100,000 per training step, there is no reason to read the other 999 Million points. However, the IO selection must be properly _coordinated_ to take the same sub-samples per batch, and consume new subsamples each training iteration.

class physicsnemo.datapipes.readers.base.Reader(
*,
pin_memory: bool = False,
include_index_in_metadata: bool = True,
coordinated_subsampling: dict[str, Any] | None = None,
)[source]#

Bases: ABC

Abstract base class for data readers.

Readers are intentionally simple and transactional:

  • Load data from a source (file, database, etc.)

  • Return (TensorDict, metadata_dict) tuples with CPU tensors

  • No threading, no prefetching, no device transfers

This design makes custom readers easy to implement. Users only need to:

  1. Implement _load_sample(index) to load raw data

  2. Implement __len__() to return dataset size

Device transfers are handled automatically by Dataset (if device parameter set). Threading/prefetching is handled by the DataLoader.

Examples

Custom reader implementation:

>>> class MyReader(Reader):
...     def __init__(self, path: str, **kwargs):
...         super().__init__(**kwargs)
...         self.data = load_my_data(path)
...
...     def _load_sample(self, index: int) -> dict[str, torch.Tensor]:
...         return {"x": torch.from_numpy(self.data[index])}
...
...     def __len__(self) -> int:
...         return len(self.data)

Subclasses must implement:

  • _load_sample(index: int) -> dict[str, torch.Tensor]

  • __len__() -> int

Optionally override:

  • _get_field_names() -> list[str]

  • _get_sample_metadata(index: int) -> dict[str, Any]

  • close()

close() None[source]#

Clean up resources (file handles, connections, etc.).

Override this in subclasses that hold open resources.

property field_names: list[str]#

List of field names available in samples.

Returns:

Field names.

Return type:

list[str]

set_epoch(epoch: int) None[source]#

Reseed the reader’s RNG for a new epoch.

Override in subclasses that use randomness. The default implementation is a no-op.

Parameters:

epoch (int) – Current epoch number.

set_generator(generator: Generator) None[source]#

Assign a torch.Generator for reproducible random sampling.

Override in subclasses that use randomness (e.g. subsampling). The default implementation is a no-op.

Parameters:

generator (torch.Generator) – Generator to use for random draws.

Usage of readers#

Readers are designed to be consumed by physicsnemo Dataset objects. Of course, use them however is desired. They support iteration syntax, and random access indexing through __getitem__ - note that the user should not implement __getitem__ directly.

Each reader will return a tensordict object of data when accessed. The conversion from dict (returned by user-implemented _load_sample) to tensordict is automatic.

Readers handle IO exclusively - it is highly encouraged, if you are building a a custom datapipe, to implement transforms as separate operations. This will enable GPU computations and composable, extensible pipelines.

Below are the current built-in readers for physicsnemo.

HDF5Reader#

class physicsnemo.datapipes.readers.hdf5.HDF5Reader(
path: Path | str,
*,
fields: list[str] | None = None,
file_pattern: str = '*.h5',
index_key: str | None = None,
pin_memory: bool = False,
include_index_in_metadata: bool = True,
)[source]#

Bases: Reader

Read samples from HDF5 files.

Supports two modes:

  1. Single file with samples indexed along first dimension of datasets

  2. Directory of HDF5 files, one sample per file

Examples

Single file mode:

>>> # File structure: data.h5 with datasets "pressure" (N, 100), "velocity" (N, 100, 3)
>>> reader = HDF5Reader("data.h5", fields=["pressure", "velocity"])
>>> data, metadata = reader[0]  # Returns (TensorDict, dict) tuple
>>> data["pressure"].shape  # torch.Size([100])

Directory mode:

>>> # Directory with sample_0.h5, sample_1.h5, ...
>>> reader = HDF5Reader("data_dir/", file_pattern="sample_*.h5")
>>> data, metadata = reader[0]  # Loads all datasets from sample_0.h5
close() None[source]#

Close HDF5 file handle.

NumpyReader#

class physicsnemo.datapipes.readers.numpy.NumpyReader(
path: str | Path,
*,
fields: list[str] | None = None,
default_values: dict[str, Tensor] | None = None,
file_pattern: str = '*.npz',
index_key: str | None = None,
pin_memory: bool = False,
include_index_in_metadata: bool = True,
coordinated_subsampling: dict[str, Any] | None = None,
)[source]#

Bases: Reader

Read samples from NumPy .npz files.

Supports two modes: 1. Single .npz file: samples indexed along first dimension of each array 2. Directory of .npz files: one sample per file

Example (single .npz):
>>> # data.npz with arrays "positions" (N, 100, 3), "features" (N, 100)
>>> reader = NumpyReader("data.npz", fields=["positions", "features"])
>>> data, metadata = reader[0]  # Returns (TensorDict, dict) tuple
>>> # Or load all arrays:
>>> reader = NumpyReader("data.npz")  # fields=None loads all
Example (directory):
>>> # Directory with sample_0.npz, sample_1.npz, ...
>>> reader = NumpyReader("data_dir/", file_pattern="sample_*.npz")
>>> data, metadata = reader[0]  # Returns (TensorDict, dict) tuple
close() None[source]#

Close file handles.

property fields: list[str]#

Fields that will be loaded (user-specified or all available).

set_epoch(epoch: int) None[source]#

Reseed the subsample RNG for a new epoch.

set_generator(generator: Generator) None[source]#

Assign a torch.Generator for reproducible subsampling.

ZarrReader#

class physicsnemo.datapipes.readers.zarr.ZarrReader(
path: str | Path,
*,
fields: list[str] | None = None,
default_values: dict[str, Tensor] | None = None,
group_pattern: str = '*.zarr',
pin_memory: bool = False,
include_index_in_metadata: bool = True,
coordinated_subsampling: dict[str, Any] | None = None,
cache_stores: bool = True,
)[source]#

Bases: Reader

Read samples from Zarr groups.

Zarr is a chunked, compressed array format ideal for large scientific datasets. Each Zarr group in the directory represents one sample. Supports loading both arrays and attributes from Zarr groups.

Examples

Basic usage:

>>> # Directory with sample_0.zarr, sample_1.zarr, ...
>>> # Each contains arrays like "positions", "features", etc.
>>> reader = ZarrReader("data_dir/", group_pattern="sample_*.zarr")
>>> data, metadata = reader[0]  # Returns (TensorDict, dict) tuple

Load only specific fields:

>>> reader = ZarrReader("data_dir/", fields=["positions", "velocity"])
>>> data, metadata = reader[0]

Load attributes from Zarr groups:

>>> # If the Zarr group has attributes like "timestep" or "scale_factor",
>>> # you can request them as fields:
>>> reader = ZarrReader("data_dir/", fields=["positions", "timestep", "scale_factor"])
>>> data, metadata = reader[0]  # data["timestep"] contains the attribute value

With coordinated subsampling for large arrays:

>>> reader = ZarrReader(
...     "data_dir/",
...     coordinated_subsampling={
...         "n_points": 50000,
...         "target_keys": ["volume_coords", "volume_fields"],
...     }
... )
>>> data, metadata = reader[0]
close() None[source]#

Close resources and cached zarr stores.

property fields: list[str]#

Fields that will be loaded (user-specified or all available).

set_epoch(epoch: int) None[source]#

Reseed the subsample RNG for a new epoch.

set_generator(generator: Generator) None[source]#

Assign a torch.Generator for reproducible subsampling.

TensorStoreZarrReader#

class physicsnemo.datapipes.readers.tensorstore_zarr.TensorStoreZarrReader(
path: str | Path,
*,
fields: list[str] | None = None,
default_values: dict[str, Tensor] | None = None,
group_pattern: str = '*.zarr',
cache_bytes_limit: int = 10000000,
data_copy_concurrency: int = 72,
file_io_concurrency: int = 72,
pin_memory: bool = False,
include_index_in_metadata: bool = True,
coordinated_subsampling: dict[str, Any] | None = None,
)[source]#

Bases: Reader

High-performance async reader for Zarr files using TensorStore.

This reader provides faster I/O than the standard ZarrReader through async operations, optimized caching, and concurrent data fetching. It’s particularly beneficial for large datasets on networked storage or cloud storage.

This is a drop-in replacement for ZarrReader with identical interface. Each Zarr group in the directory represents one sample.

Examples

Basic usage:

>>> # Directory with sample_0.zarr, sample_1.zarr, ...
>>> reader = TensorStoreZarrReader("data_dir/", group_pattern="sample_*.zarr")
>>> data, metadata = reader[0]  # Returns (TensorDict, dict) tuple

Load only specific fields:

>>> reader = TensorStoreZarrReader("data_dir/", fields=["positions", "velocity"])
>>> data, metadata = reader[0]

With coordinated subsampling for large arrays:

>>> reader = TensorStoreZarrReader(
...     "data_dir/",
...     coordinated_subsampling={
...         "n_points": 50000,
...         "target_keys": ["volume_coords", "volume_fields"],
...     }
... )
>>> data, metadata = reader[0]
Performance Tips:
  • Increase cache_bytes_limit for better performance on repeated access

  • Increase data_copy_concurrency and file_io_concurrency for parallel workloads

  • Use coordinated subsampling when reading subsets of large arrays

property fields: list[str]#

Fields that will be loaded (user-specified or all available).

set_epoch(epoch: int) None[source]#

Reseed the subsample RNG for a new epoch.

set_generator(
generator: Generator,
) None[source]#

Assign a torch.Generator for reproducible subsampling.

VTKReader#

class physicsnemo.datapipes.readers.vtk.VTKReader(
path: str | Path,
*,
keys_to_read: list[str] | None = None,
exclude_patterns: list[str] | None = None,
pin_memory: bool = False,
include_index_in_metadata: bool = True,
)[source]#

Bases: Reader

Read samples from VTK format files (.stl, .vtp, .vtu).

This reader loads mesh data from directories where each subdirectory contains VTK files representing one sample. Supports STL (surface meshes), VTP (PolyData), and VTU (UnstructuredGrid) formats.

Requires PyVista to be installed. If PyVista is not available, attempting to instantiate this reader will raise an ImportError with installation instructions.

Examples

>>> # Directory structure:
>>> # data/
>>> #   sample_0/
>>> #     geometry.stl
>>> #     surface.vtp
>>> #   sample_1/
>>> #     geometry.stl
>>> #     surface.vtp
>>> #   ...
>>>
>>> reader = VTKReader(
...     "data/",
...     keys_to_read=["stl_coordinates", "stl_faces", "surface_normals"],
... )
>>> data, metadata = reader[0]  # Returns (TensorDict, dict) tuple
>>> print(data["stl_coordinates"].shape)  # (N, 3)
Available Keys:
From .stl files:
  • stl_coordinates: Vertex coordinates, shape \((N, 3)\)

  • stl_faces: Face indices (flattened), shape \((M*3,)\)

  • stl_centers: Face centers, shape \((M, 3)\)

  • surface_normals: Face normals, shape \((M, 3)\)

From .vtp files:
  • surface_mesh_centers: Cell centers

  • surface_normals: Cell normals

  • surface_mesh_sizes: Cell areas

  • Additional fields from the VTP file

Note

VTK files are typically small enough to fit in memory, so coordinated subsampling is not supported. Use transforms for downsampling if needed.