Built-in Readers#
PhysicsNeMo’s datapipe readers are an abstracted interface for enabling data
loading from various sources into the datapipe framework. By providing a common
interface and API to implement, users can easily implement new dataset readers
and plug them into existing datapipes.
Base Reader#
Each reader class should inherit from the base Reader, below. Users should
implement at minimum two functions: _load_sample, which takes an integer index
defining an index into the dataset, and returns a dictionary of CPU tensors from the dataset. Note that you do not need to move the tensors to the GPU - it will
be handled automatically.
Additionally, users must implement the __len__ method to return the length
of the dataset.
At configuration, each reader or subclass should configure pin_memory to
true or false to set CPU memory pinning. This enables faster, async data transfer
from host to device, sometimes at the cost of higher CPU resource usage.
The Reader abstraction has configurable support for dataset metadata that will not be passed through the preprocessing pipeline, but can be optionally consumed
in a training or inference loop. To control the precise way to fetch and return
metadata, override the _get_sample_metadata class.
For some datasets, such as very high resolution volumetric datasets that will
get downsampled at training time, the Reader classes provide a fast-path,
only-read-what-you-need optimization called “coordinated_subsampling”. In
essence, if your input and output fields are both 1 billion points, but you
only will consume 100,000 per training step, there is no reason to read the
other 999 Million points. However, the IO selection must be properly _coordinated_
to take the same sub-samples per batch, and consume new subsamples each training
iteration.
- class physicsnemo.datapipes.readers.base.Reader(
- *,
- pin_memory: bool = False,
- include_index_in_metadata: bool = True,
- coordinated_subsampling: dict[str, Any] | None = None,
Bases:
ABCAbstract base class for data readers.
Readers are intentionally simple and transactional:
Load data from a source (file, database, etc.)
Return (TensorDict, metadata_dict) tuples with CPU tensors
No threading, no prefetching, no device transfers
This design makes custom readers easy to implement. Users only need to:
Implement
_load_sample(index)to load raw dataImplement
__len__()to return dataset size
Device transfers are handled automatically by Dataset (if device parameter set). Threading/prefetching is handled by the DataLoader.
Examples
Custom reader implementation:
>>> class MyReader(Reader): ... def __init__(self, path: str, **kwargs): ... super().__init__(**kwargs) ... self.data = load_my_data(path) ... ... def _load_sample(self, index: int) -> dict[str, torch.Tensor]: ... return {"x": torch.from_numpy(self.data[index])} ... ... def __len__(self) -> int: ... return len(self.data)
Subclasses must implement:
_load_sample(index: int) -> dict[str, torch.Tensor]__len__() -> int
Optionally override:
_get_field_names() -> list[str]_get_sample_metadata(index: int) -> dict[str, Any]close()
- close() None[source]#
Clean up resources (file handles, connections, etc.).
Override this in subclasses that hold open resources.
- property field_names: list[str]#
List of field names available in samples.
- Returns:
Field names.
- Return type:
list[str]
Usage of readers#
Readers are designed to be consumed by physicsnemo Dataset objects. Of course,
use them however is desired. They support iteration syntax, and random access
indexing through __getitem__ - note that the user should not implement __getitem__ directly.
Each reader will return a tensordict object of data when accessed.
The conversion from dict (returned by user-implemented _load_sample)
to tensordict is automatic.
Readers handle IO exclusively - it is highly encouraged, if you are building a a custom datapipe, to implement transforms as separate operations. This will enable GPU computations and composable, extensible pipelines.
Below are the current built-in readers for physicsnemo.
HDF5Reader#
- class physicsnemo.datapipes.readers.hdf5.HDF5Reader(
- path: Path | str,
- *,
- fields: list[str] | None = None,
- file_pattern: str = '*.h5',
- index_key: str | None = None,
- pin_memory: bool = False,
- include_index_in_metadata: bool = True,
Bases:
ReaderRead samples from HDF5 files.
Supports two modes:
Single file with samples indexed along first dimension of datasets
Directory of HDF5 files, one sample per file
Examples
Single file mode:
>>> # File structure: data.h5 with datasets "pressure" (N, 100), "velocity" (N, 100, 3) >>> reader = HDF5Reader("data.h5", fields=["pressure", "velocity"]) >>> data, metadata = reader[0] # Returns (TensorDict, dict) tuple >>> data["pressure"].shape # torch.Size([100])
Directory mode:
>>> # Directory with sample_0.h5, sample_1.h5, ... >>> reader = HDF5Reader("data_dir/", file_pattern="sample_*.h5") >>> data, metadata = reader[0] # Loads all datasets from sample_0.h5
NumpyReader#
- class physicsnemo.datapipes.readers.numpy.NumpyReader(
- path: str | Path,
- *,
- fields: list[str] | None = None,
- default_values: dict[str, Tensor] | None = None,
- file_pattern: str = '*.npz',
- index_key: str | None = None,
- pin_memory: bool = False,
- include_index_in_metadata: bool = True,
- coordinated_subsampling: dict[str, Any] | None = None,
Bases:
ReaderRead samples from NumPy .npz files.
Supports two modes: 1. Single .npz file: samples indexed along first dimension of each array 2. Directory of .npz files: one sample per file
- Example (single .npz):
>>> # data.npz with arrays "positions" (N, 100, 3), "features" (N, 100) >>> reader = NumpyReader("data.npz", fields=["positions", "features"]) >>> data, metadata = reader[0] # Returns (TensorDict, dict) tuple >>> # Or load all arrays: >>> reader = NumpyReader("data.npz") # fields=None loads all
- Example (directory):
>>> # Directory with sample_0.npz, sample_1.npz, ... >>> reader = NumpyReader("data_dir/", file_pattern="sample_*.npz") >>> data, metadata = reader[0] # Returns (TensorDict, dict) tuple
- property fields: list[str]#
Fields that will be loaded (user-specified or all available).
ZarrReader#
- class physicsnemo.datapipes.readers.zarr.ZarrReader(
- path: str | Path,
- *,
- fields: list[str] | None = None,
- default_values: dict[str, Tensor] | None = None,
- group_pattern: str = '*.zarr',
- pin_memory: bool = False,
- include_index_in_metadata: bool = True,
- coordinated_subsampling: dict[str, Any] | None = None,
- cache_stores: bool = True,
Bases:
ReaderRead samples from Zarr groups.
Zarr is a chunked, compressed array format ideal for large scientific datasets. Each Zarr group in the directory represents one sample. Supports loading both arrays and attributes from Zarr groups.
Examples
Basic usage:
>>> # Directory with sample_0.zarr, sample_1.zarr, ... >>> # Each contains arrays like "positions", "features", etc. >>> reader = ZarrReader("data_dir/", group_pattern="sample_*.zarr") >>> data, metadata = reader[0] # Returns (TensorDict, dict) tuple
Load only specific fields:
>>> reader = ZarrReader("data_dir/", fields=["positions", "velocity"]) >>> data, metadata = reader[0]
Load attributes from Zarr groups:
>>> # If the Zarr group has attributes like "timestep" or "scale_factor", >>> # you can request them as fields: >>> reader = ZarrReader("data_dir/", fields=["positions", "timestep", "scale_factor"]) >>> data, metadata = reader[0] # data["timestep"] contains the attribute value
With coordinated subsampling for large arrays:
>>> reader = ZarrReader( ... "data_dir/", ... coordinated_subsampling={ ... "n_points": 50000, ... "target_keys": ["volume_coords", "volume_fields"], ... } ... ) >>> data, metadata = reader[0]
- property fields: list[str]#
Fields that will be loaded (user-specified or all available).
TensorStoreZarrReader#
- class physicsnemo.datapipes.readers.tensorstore_zarr.TensorStoreZarrReader(
- path: str | Path,
- *,
- fields: list[str] | None = None,
- default_values: dict[str, Tensor] | None = None,
- group_pattern: str = '*.zarr',
- cache_bytes_limit: int = 10000000,
- data_copy_concurrency: int = 72,
- file_io_concurrency: int = 72,
- pin_memory: bool = False,
- include_index_in_metadata: bool = True,
- coordinated_subsampling: dict[str, Any] | None = None,
Bases:
ReaderHigh-performance async reader for Zarr files using TensorStore.
This reader provides faster I/O than the standard ZarrReader through async operations, optimized caching, and concurrent data fetching. It’s particularly beneficial for large datasets on networked storage or cloud storage.
This is a drop-in replacement for ZarrReader with identical interface. Each Zarr group in the directory represents one sample.
Examples
Basic usage:
>>> # Directory with sample_0.zarr, sample_1.zarr, ... >>> reader = TensorStoreZarrReader("data_dir/", group_pattern="sample_*.zarr") >>> data, metadata = reader[0] # Returns (TensorDict, dict) tuple
Load only specific fields:
>>> reader = TensorStoreZarrReader("data_dir/", fields=["positions", "velocity"]) >>> data, metadata = reader[0]
With coordinated subsampling for large arrays:
>>> reader = TensorStoreZarrReader( ... "data_dir/", ... coordinated_subsampling={ ... "n_points": 50000, ... "target_keys": ["volume_coords", "volume_fields"], ... } ... ) >>> data, metadata = reader[0]
- Performance Tips:
Increase
cache_bytes_limitfor better performance on repeated accessIncrease
data_copy_concurrencyandfile_io_concurrencyfor parallel workloadsUse coordinated subsampling when reading subsets of large arrays
- property fields: list[str]#
Fields that will be loaded (user-specified or all available).
VTKReader#
- class physicsnemo.datapipes.readers.vtk.VTKReader(
- path: str | Path,
- *,
- keys_to_read: list[str] | None = None,
- exclude_patterns: list[str] | None = None,
- pin_memory: bool = False,
- include_index_in_metadata: bool = True,
Bases:
ReaderRead samples from VTK format files (.stl, .vtp, .vtu).
This reader loads mesh data from directories where each subdirectory contains VTK files representing one sample. Supports STL (surface meshes), VTP (PolyData), and VTU (UnstructuredGrid) formats.
Requires PyVista to be installed. If PyVista is not available, attempting to instantiate this reader will raise an ImportError with installation instructions.
Examples
>>> # Directory structure: >>> # data/ >>> # sample_0/ >>> # geometry.stl >>> # surface.vtp >>> # sample_1/ >>> # geometry.stl >>> # surface.vtp >>> # ... >>> >>> reader = VTKReader( ... "data/", ... keys_to_read=["stl_coordinates", "stl_faces", "surface_normals"], ... ) >>> data, metadata = reader[0] # Returns (TensorDict, dict) tuple >>> print(data["stl_coordinates"].shape) # (N, 3)
- Available Keys:
- From .stl files:
stl_coordinates: Vertex coordinates, shape \((N, 3)\)stl_faces: Face indices (flattened), shape \((M*3,)\)stl_centers: Face centers, shape \((M, 3)\)surface_normals: Face normals, shape \((M, 3)\)
- From .vtp files:
surface_mesh_centers: Cell centerssurface_normals: Cell normalssurface_mesh_sizes: Cell areasAdditional fields from the VTP file
Note
VTK files are typically small enough to fit in memory, so coordinated subsampling is not supported. Use transforms for downsampling if needed.