CAE Datapipes#

The CAE Datapipes are v1 datapipes for specific datasets for external aerodynamics datasets. These are maintained but not under active development in every case.

The MeshDataPipe uses VTK to read CFD mesh data and simulations, and DALI for data loading and preprocessing. The MeshDataPipe is used in the DataCenter example.

class physicsnemo.datapipes.cae.mesh_datapipe.MeshDaliExternalSource( data_paths: Iterable[str], file_format: str, variables: List[str], num_samples: int, batch_size: int = 1, shuffle: bool = True, process_rank: int = 0, world_size: int = 1, cache_data: bool = False, )[source]#

Bases: object

DALI Source for lazy-loading with caching of mesh data

Parameters:

data_paths (Iterable[str]) – Directory where data is stored
num_samples (int) – Total number of training samples
batch_size (int, optional) – Batch size, by default 1
shuffle (bool, optional) – Shuffle dataset, by default True
process_rank (int, optional) – Rank ID of local process, by default 0
world_size (int, optional) – Number of training processes, by default 1
cache_data (False, optional) – Whether to cache the data in memory for faster access in subsequent epochs, by default False

Note

For more information about DALI external source operator: https://docs.nvidia.com/deeplearning/dali/archives/dali_1_13_0/user-guide/docs/examples/general/data_loading/parallel_external_source.html

class physicsnemo.datapipes.cae.mesh_datapipe.MeshDatapipe( data_dir: str, variables: List[str], num_variables: int, file_format: str = 'vtp', stats_dir: str | None = None, batch_size: int = 1, num_samples: int = 1, shuffle: bool = True, num_workers: int = 1, device: str | device = 'cuda', process_rank: int = 0, world_size: int = 1, cache_data: bool = False, parallel: bool = True, )[source]#

Bases: Datapipe

DALI data pipeline for mesh data

Parameters:

data_dir (str) – Directory where ERA5 data is stored
variables (List[str, None]) – Ordered list of variables to be loaded from the files
num_variables (int) – Number of variables to be loaded from the files
file_format (str, optional) – File format of the data, by default “vtp” Supported formats: “vtp”, “vtu”, “cgns”
stats_dir (Union[str, None], optional) – Directory where statistics are stored, by default None If provided, the statistics are used to normalize the attributes
batch_size (int, optional) – Batch size, by default 1
num_steps (int, optional) – Number of timesteps are included in the output variables, by default 1
shuffle (bool, optional) – Shuffle dataset, by default True
num_workers (int, optional) – Number of workers, by default 1
device (Union[str, torch.device], optional) – Device for DALI pipeline to run on, by default cuda
process_rank (int, optional) – Rank ID of local process, by default 0
world_size (int, optional) – Number of training processes, by default 1
cache_data (False, optional) – Whether to cache the data in memory for faster access in subsequent epochs, by default False
Parallel (True, optional) – Setting parallel=True for an external_source node indicates to the pipeline to run the source in Python worker processes started by DALI.

load_statistics() → None[source]#

Loads statistics from pre-computed numpy files

The statistic files should be of name global_means.npy and global_std.npy with a shape of [1, C] located in the stat_dir.

Raises:

IOError – If mean or std numpy files are not found
AssertionError – If loaded numpy arrays are not of correct size

parse_dataset_files() → None[source]#

Parses the data directory for valid files and determines training samples

Raises:: ValueError – In channels specified or number of samples per year is not valid

class physicsnemo.datapipes.cae.mesh_datapipe.MetaData( name: str = 'MeshDatapipe', auto_device: bool = True, cuda_graphs: bool = True, ddp_sharding: bool = True, )[source]#: Bases: DatapipeMetaData

The DoMINO DataPipe reads the DrivearML dataset, and other datasets, for the DoMINO model for external aerodynamics. The expected format of inputs can be achieved using PhysicsNeMo-Curator.

This code provides the datapipe for reading the processed npy files, generating multi-res grids, calculating signed distance fields, sampling random points in the volume and on surface, normalizing fields and returning the output tensors as a dictionary.

This datapipe also non-dimensionalizes the fields, so the order in which the variables should be fixed: velocity, pressure, turbulent viscosity for volume variables and pressure, wall-shear-stress for surface variables. The different parameters such as variable names, domain resolution, sampling size etc. are configurable in config.yaml.

class physicsnemo.datapipes.cae.domino_datapipe.BoundingBox(*args, **kwargs)[source]#

Bases: Protocol

Type definition for the required format of bounding box dimensions.

class physicsnemo.datapipes.cae.domino_datapipe.CachedDoMINODataset( data_path: str | Path, phase: Literal['train', 'val', 'test'] = 'train', sampling: bool = False, volume_points_sample: int | None = None, surface_points_sample: int | None = None, geom_points_sample: int | None = None, model_type=None, deterministic_seed=False, surface_sampling_algorithm='area_weighted', )[source]#

Bases: Dataset

Dataset for reading cached DoMINO data files, with optional resampling. Acts as a drop-in replacement for DoMINODataPipe.

class physicsnemo.datapipes.cae.domino_datapipe.DoMINODataConfig( data_path: Path | None, phase: Literal['train', 'val', 'test'], surface_variables: Sequence | None = ('pMean', 'wallShearStress'), surface_points_sample: int = 1024, num_surface_neighbors: int = 11, surface_sampling_algorithm: str = typing.Literal['area_weighted', 'random'], surface_factors: Sequence | None = None, bounding_box_dims_surf: BoundingBox | Sequence | None = None, volume_variables: Sequence | None = ('UMean', 'pMean'), volume_points_sample: int = 1024, volume_sample_from_disk: bool = False, volume_factors: Sequence | None = None, bounding_box_dims: BoundingBox | Sequence | None = None, grid_resolution: Sequence = (256, 96, 64), normalize_coordinates: bool = False, sample_in_bbox: bool = False, sampling: bool = False, geom_points_sample: int = 300000, scaling_type: Literal['min_max_scaling', 'mean_std_scaling'] | None = None, compute_scaling_factors: bool = False, caching: bool = False, deterministic: bool = False, gpu_preprocessing: bool = True, gpu_output: bool = True, shard_grid: bool = False, shard_points: bool = False, )[source]#

Bases: object

Configuration for DoMINO dataset processing pipeline.

data_path#

Path to the dataset to load.

Type:: pathlib.Path | None

phase#

Which phase of data to load (“train”, “val”, or “test”).

Type:: Literal[‘train’, ‘val’, ‘test’]

surface_variables#

(Surface specific) Names of surface variables.

Type:: Sequence | None

surface_points_sample#

(Surface specific) Number of surface points to sample per batch.

Type:: int

num_surface_neighbors#

(Surface specific) Number of surface neighbors to consider for nearest neighbors approach.

Type:: int

surface_sampling_algorithm#

(Surface specific) Algorithm to use for surface sampling (“area_weighted” or “random”).

Type:: str

surface_factors#

(Surface specific) Non-dimensionalization factors for surface variables. If set, and scaling_type is: - min_max_scaling -> rescale surface_fields to the min/max set here - mean_std_scaling -> rescale surface_fields to the mean and std set here.

Type:: Sequence | None

bounding_box_dims_surf#

(Surface specific) Dimensions of bounding box. Must be an object with min/max attributes that are arraylike.

Type:: physicsnemo.datapipes.cae.domino_datapipe.BoundingBox | Sequence | None

volume_variables#

(Volume specific) Names of volume variables.

Type:: Sequence | None

volume_points_sample#

(Volume specific) Number of volume points to sample per batch.

Type:: int

volume_sample_from_disk#

(Volume specific) If the volume data is in a shuffled state on disk, read contiguous chunks of the data rather than the entire volume data. This greatly accelerates IO in bandwidth limited systems or when the volumetric data is very large.

Type:: bool

volume_factors#

(Volume specific) Non-dimensionalization factors for volume variables scaling. If set, and scaling_type is: - min_max_scaling -> rescale volume_fields to the min/max set here - mean_std_scaling -> rescale volume_fields to the mean and std set here.

Type:: Sequence | None

bounding_box_dims#

(Volume specific) Dimensions of bounding box. Must be an object with min/max attributes that are arraylike.

Type:: physicsnemo.datapipes.cae.domino_datapipe.BoundingBox | Sequence | None

grid_resolution#

Resolution of the latent grid.

Type:: Sequence

normalize_coordinates#

Whether to normalize coordinates based on min/max values. For surfaces: uses s_min/s_max, defined from: - Surface bounding box, if defined. - Min/max of the stl_vertices For volumes: uses c_min/c_max, defined from: - Volume bounding_box if defined, - 1.5x s_min/max otherwise, except c_min[2] = s_min[2] in this case

Type:: bool

sample_in_bbox#

Whether to sample points in a specified bounding box. Uses the same min/max points as coordinate normalization. Only performed if compute_scaling_factors is false.

Type:: bool

sampling#

Whether to downsample the full resolution mesh to fit in GPU memory. Surface and volume sampling points are configured separately as: - surface.points_sample - volume.points_sample

Type:: bool

geom_points_sample#

Number of STL points sampled per batch. Independent of volume.points_sample and surface.points_sample.

Type:: int

scaling_type#

Scaling type for volume variables. If used, will rescale the volume_fields and surface fields outputs. Requires volume.factor and surface.factor to be set.

Type:: Literal[‘min_max_scaling’, ‘mean_std_scaling’] | None

compute_scaling_factors#

Whether to compute scaling factors. Not available if caching. Many preprocessing pieces are disabled if computing scaling factors.

Type:: bool

caching#

Whether this is for caching or serving.

Type:: bool

deterministic#

Whether to use a deterministic seed for sampling and random numbers.

Type:: bool

gpu_preprocessing#

Whether to do preprocessing on the GPU (False for CPU).

Type:: bool

gpu_output#

Whether to return output on the GPU as cupy arrays. If False, returns numpy arrays. You might choose gpu_preprocessing=True and gpu_output=False if caching.

Type:: bool

shard_grid#

Whether to shard the grid across GPUs for domain parallelism. Applies to the surf_grid and similiar tensors.

Type:: bool

shard_points#

Whether to shard the points across GPUs for domain parallelism. Applies to the volume_fields/surface_fields and similiar tensors.

Type:: bool

class physicsnemo.datapipes.cae.domino_datapipe.DoMINODataPipe(

input_path,

model_type: Literal['surface', 'volume', 'combined'],

pin_memory: bool = False,

**data_config_overrides,

)[source]#

Bases: Dataset

Datapipe for DoMINO

Leverages a dataset for the actual reading of the data, and this object is responsible for preprocessing the data.

compute_stl_scaling_and_surface_grids() → tuple[Tensor, Tensor, Tensor][source]#

Compute the min and max for the defining mesh.

If the user supplies a bounding box, we use that. Otherwise, it raises an error.

The returned min/max and grid are used for surface data.

compute_volume_scaling_and_grids() → tuple[Tensor, Tensor, Tensor][source]#

Compute the min and max and grid for volume data.

If the user supplies a bounding box, we use that. Otherwise, it raises an error.

downsample_geometry(stl_vertices) → Tensor[source]#

Downsample the geometry to the desired number of points.

Parameters:: stl_vertices – The vertices of the surface.

process_volume( c_min: Tensor, c_max: Tensor, volume_coordinates: Tensor, volume_grid: Tensor, center_of_mass: Tensor, stl_vertices: Tensor, stl_indices: Tensor, volume_fields: Tensor | None, ) → dict[str, Tensor][source]#

Preprocess the volume data.

First, if configured, we reject points not in the volume bounding box.

Next, if sampling is enabled, we sample the volume points and apply that sampling to the ground truth too, if it’s present.

scale_model_targets( fields: Tensor, factors: Tensor, ) → Tensor[source]#: Scale the model targets based on the configured scaling factors.

set_dataset(dataset: Iterable) → None[source]#: Pass a dataset to the datapipe to enable iterating over both in one pass.

unscale_model_outputs( volume_fields: Tensor | None = None, surface_fields: Tensor | None = None, )[source]#

Unscale the model outputs based on the configured scaling factors.

The unscaling is included here to make it a consistent interface regardless of the scaling factors and type used.

physicsnemo.datapipes.cae.domino_datapipe.compute_scaling_factors( cfg: DictConfig, input_path: str, target_keys: list[str], max_samples=20, ) → tuple[Tensor, Tensor, Tensor, Tensor][source]#

Using the dataset at the path, compute the mean, std, min, and max of the target keys.

Parameters:

cfg – Hydra configuration object containing all parameters
input_path – Path to the dataset to load.
target_keys – List of keys to compute the mean, std, min, and max of.
use_cache – (deprecated) This argument has no effect.

The Transolver DataPipe reads the same inputs as the DoMINO DataPipe, but produces outputs for the Transolver and GeoTransolver models for external aerodynamics.

This code provides the datapipe for reading the processed npy files, generating multi-res grids, calculating signed distance fields, sampling random points in the volume and on surface, normalizing fields and returning the output tensors as a dictionary.

This datapipe also non-dimensionalizes the fields, so the order in which the variables should be fixed: velocity, pressure, turbulent viscosity for volume variables and pressure, wall-shear-stress for surface variables. The different parameters such as variable names, domain resolution, sampling size etc. are configurable in config.yaml.

class physicsnemo.datapipes.cae.transolver_datapipe.TransolverDataConfig( data_path: Path | None, model_type: Literal['surface', 'volume', 'combined'] = 'surface', resolution: int = 200000, include_normals: bool = True, include_sdf: bool = True, include_geometry: bool = False, geometry_sampling: int = 300000, scaling_type: Literal['min_max_scaling', 'mean_std_scaling'] | None = None, surface_factors: Tensor | None = None, volume_factors: Tensor | None = None, translational_invariance: bool = False, reference_origin: Tensor | None = None, scale_invariance: bool = False, reference_scale: list[float] | None = None, broadcast_global_features: bool = True, volume_sample_from_disk: bool = True, return_mesh_features: bool = False, )[source]#

Bases: object

Configuration for Transolver data processing pipeline.

Attributes:

data_path#

Path to the dataset to load.

Type:: pathlib.Path | None

model_type#

Type of the model (“surface” or “volume”).

Type:: Literal[‘surface’, ‘volume’, ‘combined’]

resolution#

Resolution of the sampled data, per batch.

Type:: int

include_normals#

Whether to include surface normals in embeddings.

Type:: bool

include_sdf#

Whether to include signed distance fields in embeddings.

Type:: bool

translational_invariance#

Enable translational adjustment using center of mass.

Type:: bool

reference_origin#

Origin for translational invariance, defaults to the center of mass.

Type:: torch.Tensor | None

broadcast_global_features#

Whether to apply global features across all points.

Type:: bool

volume_sample_from_disk#

Whether to sample points from the disk for volume data.

Type:: bool

return_mesh_features#

Whether to return the mesh areas and normals for the surface data. Used to compute force coefficients. Transformations are applied to the mesh coordinates.

Type:: bool

class physicsnemo.datapipes.cae.transolver_datapipe.TransolverDataPipe(

input_path,

model_type: Literal['surface', 'volume'],

pin_memory: bool = False,

**data_config_overrides,

)[source]#

Bases: Dataset

Base Datapipe for Transolver

Leverages a dataset for the actual reading of the data, and this object is responsible for preprocessing the data.

process_data(data_dict)[source]#

Preprocess the data. We have slight differences between surface and volume data processing, mostly revolving around the keys that represent the inputs.

For surface data, we use the mesh coordinates and normals as the embeddings.
- Normals are always normalized to 1.0, and are a relative direction.
- coordinates can be shifted to the center of mass, and then the whole coordinate system can be aligned to the preferred direction.
- SDF is identically 0 for surface data.
- Optionally, if the scale invariance is enabled, the coordinates are scaled by the (maybe-rotated) scale factor.
For Volume data: we still use the volume coordinates
- normals are approximated as the direction between the volume point and closest mesh point. Normalized to 1.0.
- SDF is not zero for volume data.

To make the calculations consistent and logical to follow: - First, get the coordinates (volume_mesh_centers or surface_mesh_centers, usually)

which is a configuration.

Second, get the STL information. We need the “stl_vertices” and “stl_indices” to compute an SDF. We downsample “stl_coordinates” to potentially encode a geometry tensor, which is optional.

Then, start imposing optional symmetries: - Impose translation invariance. For every “position-like” tensor, subtract

off the reference_origin if translation invariance is enabled.

Second, impose scale invariance: for every position-like tensor, multiply by the reference scale.
Finally, apply rotation invariance. Normals are rotated, points are rotated. Roation requires not just a reference vector (in the config) but a vector unique to this example to come from the data - we have to rotate to it.

After that, the rest is simple:

Spatial Encodings are the point locations + normal vectors (optional) + sdf (optional) - If the normals aren’t provided, we derive them from the center of mass (without SDF) or SDF point (with SDF)
Geometry encoding (if using) is the STL coordinates, downsampled.
parameter encodings are straight forward vectors / reference values.

The downstream applications can take the embeddings and the features as needed.

process_geometry( data_dict, center_of_mass: Tensor | None = None, scale_factor: Tensor | None = None, )[source]#: Process the geometry data.

scale_model_targets( fields: Tensor, factors: Tensor, ) → Tensor[source]#: Scale the model targets based on the configured scaling factors.

set_dataset(dataset: Iterable) → None[source]#: Pass a dataset to the datapipe to enable iterating over both in one pass.

unscale_model_targets( fields: Tensor | None = None, air_density: Tensor | None = None, stream_velocity: Tensor | None = None, factor_type: Literal['surface', 'volume', 'auto'] = 'auto', )[source]#

Unscale the model outputs based on the configured scaling factors.

The unscaling is included here to make it a consistent interface regardless of the scaling factors and type used.

physicsnemo.datapipes.cae.transolver_datapipe.poisson_sample_indices_fixed(N: int, k: int, device=None)[source]#: This function is a nearly uniform sampler of indices for when the number of indices to sample is very, very large. It’s useful when the number of indices to sample is larger than 2^24 and torch multinomial can’t work. Unlike using randperm, there is no need to materialize and randomize the entire tensor of indices.