CAE Datapipes#
The CAE Datapipes are v1 datapipes for specific datasets for external aerodynamics datasets. These are maintained but not under active development in every case.
The MeshDataPipe uses VTK to read CFD mesh data and simulations, and DALI for data loading and preprocessing. The MeshDataPipe is used in the DataCenter example.
- class physicsnemo.datapipes.cae.mesh_datapipe.MeshDaliExternalSource(
- data_paths: Iterable[str],
- file_format: str,
- variables: List[str],
- num_samples: int,
- batch_size: int = 1,
- shuffle: bool = True,
- process_rank: int = 0,
- world_size: int = 1,
- cache_data: bool = False,
Bases:
objectDALI Source for lazy-loading with caching of mesh data
- Parameters:
data_paths (Iterable[str]) – Directory where data is stored
num_samples (int) – Total number of training samples
batch_size (int, optional) – Batch size, by default 1
shuffle (bool, optional) – Shuffle dataset, by default True
process_rank (int, optional) – Rank ID of local process, by default 0
world_size (int, optional) – Number of training processes, by default 1
cache_data (False, optional) – Whether to cache the data in memory for faster access in subsequent epochs, by default False
Note
For more information about DALI external source operator: https://docs.nvidia.com/deeplearning/dali/archives/dali_1_13_0/user-guide/docs/examples/general/data_loading/parallel_external_source.html
- class physicsnemo.datapipes.cae.mesh_datapipe.MeshDatapipe(
- data_dir: str,
- variables: List[str],
- num_variables: int,
- file_format: str = 'vtp',
- stats_dir: str | None = None,
- batch_size: int = 1,
- num_samples: int = 1,
- shuffle: bool = True,
- num_workers: int = 1,
- device: str | device = 'cuda',
- process_rank: int = 0,
- world_size: int = 1,
- cache_data: bool = False,
- parallel: bool = True,
Bases:
DatapipeDALI data pipeline for mesh data
- Parameters:
data_dir (str) – Directory where ERA5 data is stored
variables (List[str, None]) – Ordered list of variables to be loaded from the files
num_variables (int) – Number of variables to be loaded from the files
file_format (str, optional) – File format of the data, by default “vtp” Supported formats: “vtp”, “vtu”, “cgns”
stats_dir (Union[str, None], optional) – Directory where statistics are stored, by default None If provided, the statistics are used to normalize the attributes
batch_size (int, optional) – Batch size, by default 1
num_steps (int, optional) – Number of timesteps are included in the output variables, by default 1
shuffle (bool, optional) – Shuffle dataset, by default True
num_workers (int, optional) – Number of workers, by default 1
device (Union[str, torch.device], optional) – Device for DALI pipeline to run on, by default cuda
process_rank (int, optional) – Rank ID of local process, by default 0
world_size (int, optional) – Number of training processes, by default 1
cache_data (False, optional) – Whether to cache the data in memory for faster access in subsequent epochs, by default False
Parallel (True, optional) – Setting parallel=True for an external_source node indicates to the pipeline to run the source in Python worker processes started by DALI.
- load_statistics() None[source]#
Loads statistics from pre-computed numpy files
The statistic files should be of name global_means.npy and global_std.npy with a shape of [1, C] located in the stat_dir.
- Raises:
IOError – If mean or std numpy files are not found
AssertionError – If loaded numpy arrays are not of correct size
- class physicsnemo.datapipes.cae.mesh_datapipe.MetaData(
- name: str = 'MeshDatapipe',
- auto_device: bool = True,
- cuda_graphs: bool = True,
- ddp_sharding: bool = True,
Bases:
DatapipeMetaData
The DoMINO DataPipe reads the DrivearML dataset, and other datasets, for the DoMINO model for external aerodynamics. The expected format of inputs can be achieved using PhysicsNeMo-Curator.
This code provides the datapipe for reading the processed npy files, generating multi-res grids, calculating signed distance fields, sampling random points in the volume and on surface, normalizing fields and returning the output tensors as a dictionary.
This datapipe also non-dimensionalizes the fields, so the order in which the variables should be fixed: velocity, pressure, turbulent viscosity for volume variables and pressure, wall-shear-stress for surface variables. The different parameters such as variable names, domain resolution, sampling size etc. are configurable in config.yaml.
- class physicsnemo.datapipes.cae.domino_datapipe.BoundingBox(*args, **kwargs)[source]#
Bases:
ProtocolType definition for the required format of bounding box dimensions.
- class physicsnemo.datapipes.cae.domino_datapipe.CachedDoMINODataset(
- data_path: str | Path,
- phase: Literal['train', 'val', 'test'] = 'train',
- sampling: bool = False,
- volume_points_sample: int | None = None,
- surface_points_sample: int | None = None,
- geom_points_sample: int | None = None,
- model_type=None,
- deterministic_seed=False,
- surface_sampling_algorithm='area_weighted',
Bases:
DatasetDataset for reading cached DoMINO data files, with optional resampling. Acts as a drop-in replacement for DoMINODataPipe.
- class physicsnemo.datapipes.cae.domino_datapipe.DoMINODataConfig(
- data_path: Path | None,
- phase: Literal['train', 'val', 'test'],
- surface_variables: Sequence | None = ('pMean', 'wallShearStress'),
- surface_points_sample: int = 1024,
- num_surface_neighbors: int = 11,
- surface_sampling_algorithm: str = typing.Literal['area_weighted', 'random'],
- surface_factors: Sequence | None = None,
- bounding_box_dims_surf: BoundingBox | Sequence | None = None,
- volume_variables: Sequence | None = ('UMean', 'pMean'),
- volume_points_sample: int = 1024,
- volume_sample_from_disk: bool = False,
- volume_factors: Sequence | None = None,
- bounding_box_dims: BoundingBox | Sequence | None = None,
- grid_resolution: Sequence = (256, 96, 64),
- normalize_coordinates: bool = False,
- sample_in_bbox: bool = False,
- sampling: bool = False,
- geom_points_sample: int = 300000,
- scaling_type: Literal['min_max_scaling', 'mean_std_scaling'] | None = None,
- compute_scaling_factors: bool = False,
- caching: bool = False,
- deterministic: bool = False,
- gpu_preprocessing: bool = True,
- gpu_output: bool = True,
- shard_grid: bool = False,
- shard_points: bool = False,
Bases:
objectConfiguration for DoMINO dataset processing pipeline.
- data_path#
Path to the dataset to load.
- Type:
pathlib.Path | None
- phase#
Which phase of data to load (“train”, “val”, or “test”).
- Type:
Literal[‘train’, ‘val’, ‘test’]
- surface_variables#
(Surface specific) Names of surface variables.
- Type:
Sequence | None
- surface_points_sample#
(Surface specific) Number of surface points to sample per batch.
- Type:
int
- num_surface_neighbors#
(Surface specific) Number of surface neighbors to consider for nearest neighbors approach.
- Type:
int
- surface_sampling_algorithm#
(Surface specific) Algorithm to use for surface sampling (“area_weighted” or “random”).
- Type:
str
- surface_factors#
(Surface specific) Non-dimensionalization factors for surface variables. If set, and scaling_type is: - min_max_scaling -> rescale surface_fields to the min/max set here - mean_std_scaling -> rescale surface_fields to the mean and std set here.
- Type:
Sequence | None
- bounding_box_dims_surf#
(Surface specific) Dimensions of bounding box. Must be an object with min/max attributes that are arraylike.
- Type:
physicsnemo.datapipes.cae.domino_datapipe.BoundingBox | Sequence | None
- volume_variables#
(Volume specific) Names of volume variables.
- Type:
Sequence | None
- volume_points_sample#
(Volume specific) Number of volume points to sample per batch.
- Type:
int
- volume_sample_from_disk#
(Volume specific) If the volume data is in a shuffled state on disk, read contiguous chunks of the data rather than the entire volume data. This greatly accelerates IO in bandwidth limited systems or when the volumetric data is very large.
- Type:
bool
- volume_factors#
(Volume specific) Non-dimensionalization factors for volume variables scaling. If set, and scaling_type is: - min_max_scaling -> rescale volume_fields to the min/max set here - mean_std_scaling -> rescale volume_fields to the mean and std set here.
- Type:
Sequence | None
- bounding_box_dims#
(Volume specific) Dimensions of bounding box. Must be an object with min/max attributes that are arraylike.
- Type:
physicsnemo.datapipes.cae.domino_datapipe.BoundingBox | Sequence | None
- grid_resolution#
Resolution of the latent grid.
- Type:
Sequence
- normalize_coordinates#
Whether to normalize coordinates based on min/max values. For surfaces: uses s_min/s_max, defined from: - Surface bounding box, if defined. - Min/max of the stl_vertices For volumes: uses c_min/c_max, defined from: - Volume bounding_box if defined, - 1.5x s_min/max otherwise, except c_min[2] = s_min[2] in this case
- Type:
bool
- sample_in_bbox#
Whether to sample points in a specified bounding box. Uses the same min/max points as coordinate normalization. Only performed if compute_scaling_factors is false.
- Type:
bool
- sampling#
Whether to downsample the full resolution mesh to fit in GPU memory. Surface and volume sampling points are configured separately as: - surface.points_sample - volume.points_sample
- Type:
bool
- geom_points_sample#
Number of STL points sampled per batch. Independent of volume.points_sample and surface.points_sample.
- Type:
int
- scaling_type#
Scaling type for volume variables. If used, will rescale the volume_fields and surface fields outputs. Requires volume.factor and surface.factor to be set.
- Type:
Literal[‘min_max_scaling’, ‘mean_std_scaling’] | None
- compute_scaling_factors#
Whether to compute scaling factors. Not available if caching. Many preprocessing pieces are disabled if computing scaling factors.
- Type:
bool
- caching#
Whether this is for caching or serving.
- Type:
bool
- deterministic#
Whether to use a deterministic seed for sampling and random numbers.
- Type:
bool
- gpu_preprocessing#
Whether to do preprocessing on the GPU (False for CPU).
- Type:
bool
- gpu_output#
Whether to return output on the GPU as cupy arrays. If False, returns numpy arrays. You might choose gpu_preprocessing=True and gpu_output=False if caching.
- Type:
bool
- shard_grid#
Whether to shard the grid across GPUs for domain parallelism. Applies to the surf_grid and similiar tensors.
- Type:
bool
- shard_points#
Whether to shard the points across GPUs for domain parallelism. Applies to the volume_fields/surface_fields and similiar tensors.
- Type:
bool
- class physicsnemo.datapipes.cae.domino_datapipe.DoMINODataPipe(
- input_path,
- model_type: Literal['surface', 'volume', 'combined'],
- pin_memory: bool = False,
- **data_config_overrides,
Bases:
DatasetDatapipe for DoMINO
Leverages a dataset for the actual reading of the data, and this object is responsible for preprocessing the data.
- compute_stl_scaling_and_surface_grids() tuple[Tensor, Tensor, Tensor][source]#
Compute the min and max for the defining mesh.
If the user supplies a bounding box, we use that. Otherwise, it raises an error.
The returned min/max and grid are used for surface data.
- compute_volume_scaling_and_grids() tuple[Tensor, Tensor, Tensor][source]#
Compute the min and max and grid for volume data.
If the user supplies a bounding box, we use that. Otherwise, it raises an error.
- downsample_geometry(stl_vertices) Tensor[source]#
Downsample the geometry to the desired number of points.
- Parameters:
stl_vertices – The vertices of the surface.
- process_volume(
- c_min: Tensor,
- c_max: Tensor,
- volume_coordinates: Tensor,
- volume_grid: Tensor,
- center_of_mass: Tensor,
- stl_vertices: Tensor,
- stl_indices: Tensor,
- volume_fields: Tensor | None,
Preprocess the volume data.
First, if configured, we reject points not in the volume bounding box.
Next, if sampling is enabled, we sample the volume points and apply that sampling to the ground truth too, if it’s present.
- scale_model_targets(
- fields: Tensor,
- factors: Tensor,
Scale the model targets based on the configured scaling factors.
- physicsnemo.datapipes.cae.domino_datapipe.compute_scaling_factors(
- cfg: DictConfig,
- input_path: str,
- target_keys: list[str],
- max_samples=20,
Using the dataset at the path, compute the mean, std, min, and max of the target keys.
- Parameters:
cfg – Hydra configuration object containing all parameters
input_path – Path to the dataset to load.
target_keys – List of keys to compute the mean, std, min, and max of.
use_cache – (deprecated) This argument has no effect.
The Transolver DataPipe reads the same inputs as the DoMINO DataPipe, but produces outputs for the Transolver and GeoTransolver models for external aerodynamics.
This code provides the datapipe for reading the processed npy files, generating multi-res grids, calculating signed distance fields, sampling random points in the volume and on surface, normalizing fields and returning the output tensors as a dictionary.
This datapipe also non-dimensionalizes the fields, so the order in which the variables should be fixed: velocity, pressure, turbulent viscosity for volume variables and pressure, wall-shear-stress for surface variables. The different parameters such as variable names, domain resolution, sampling size etc. are configurable in config.yaml.
- class physicsnemo.datapipes.cae.transolver_datapipe.TransolverDataConfig(
- data_path: Path | None,
- model_type: Literal['surface', 'volume', 'combined'] = 'surface',
- resolution: int = 200000,
- include_normals: bool = True,
- include_sdf: bool = True,
- include_geometry: bool = False,
- geometry_sampling: int = 300000,
- scaling_type: Literal['min_max_scaling', 'mean_std_scaling'] | None = None,
- surface_factors: Tensor | None = None,
- volume_factors: Tensor | None = None,
- translational_invariance: bool = False,
- reference_origin: Tensor | None = None,
- scale_invariance: bool = False,
- reference_scale: list[float] | None = None,
- broadcast_global_features: bool = True,
- volume_sample_from_disk: bool = True,
- return_mesh_features: bool = False,
Bases:
objectConfiguration for Transolver data processing pipeline.
Attributes:
- data_path#
Path to the dataset to load.
- Type:
pathlib.Path | None
- model_type#
Type of the model (“surface” or “volume”).
- Type:
Literal[‘surface’, ‘volume’, ‘combined’]
- resolution#
Resolution of the sampled data, per batch.
- Type:
int
- include_normals#
Whether to include surface normals in embeddings.
- Type:
bool
- include_sdf#
Whether to include signed distance fields in embeddings.
- Type:
bool
- translational_invariance#
Enable translational adjustment using center of mass.
- Type:
bool
- reference_origin#
Origin for translational invariance, defaults to the center of mass.
- Type:
torch.Tensor | None
- broadcast_global_features#
Whether to apply global features across all points.
- Type:
bool
- volume_sample_from_disk#
Whether to sample points from the disk for volume data.
- Type:
bool
- return_mesh_features#
Whether to return the mesh areas and normals for the surface data. Used to compute force coefficients. Transformations are applied to the mesh coordinates.
- Type:
bool
- class physicsnemo.datapipes.cae.transolver_datapipe.TransolverDataPipe(
- input_path,
- model_type: Literal['surface', 'volume'],
- pin_memory: bool = False,
- **data_config_overrides,
Bases:
DatasetBase Datapipe for Transolver
Leverages a dataset for the actual reading of the data, and this object is responsible for preprocessing the data.
- process_data(data_dict)[source]#
Preprocess the data. We have slight differences between surface and volume data processing, mostly revolving around the keys that represent the inputs.
- For surface data, we use the mesh coordinates and normals as the embeddings.
Normals are always normalized to 1.0, and are a relative direction.
coordinates can be shifted to the center of mass, and then the whole coordinate system can be aligned to the preferred direction.
SDF is identically 0 for surface data.
Optionally, if the scale invariance is enabled, the coordinates are scaled by the (maybe-rotated) scale factor.
- For Volume data: we still use the volume coordinates
normals are approximated as the direction between the volume point and closest mesh point. Normalized to 1.0.
SDF is not zero for volume data.
To make the calculations consistent and logical to follow: - First, get the coordinates (volume_mesh_centers or surface_mesh_centers, usually)
which is a configuration.
Second, get the STL information. We need the “stl_vertices” and “stl_indices” to compute an SDF. We downsample “stl_coordinates” to potentially encode a geometry tensor, which is optional.
Then, start imposing optional symmetries: - Impose translation invariance. For every “position-like” tensor, subtract
off the reference_origin if translation invariance is enabled.
Second, impose scale invariance: for every position-like tensor, multiply by the reference scale.
Finally, apply rotation invariance. Normals are rotated, points are rotated. Roation requires not just a reference vector (in the config) but a vector unique to this example to come from the data - we have to rotate to it.
- After that, the rest is simple:
Spatial Encodings are the point locations + normal vectors (optional) + sdf (optional) - If the normals aren’t provided, we derive them from the center of mass (without SDF) or SDF point (with SDF)
Geometry encoding (if using) is the STL coordinates, downsampled.
parameter encodings are straight forward vectors / reference values.
The downstream applications can take the embeddings and the features as needed.
- process_geometry(
- data_dict,
- center_of_mass: Tensor | None = None,
- scale_factor: Tensor | None = None,
Process the geometry data.
- scale_model_targets(
- fields: Tensor,
- factors: Tensor,
Scale the model targets based on the configured scaling factors.
- set_dataset(dataset: Iterable) None[source]#
Pass a dataset to the datapipe to enable iterating over both in one pass.
- unscale_model_targets(
- fields: Tensor | None = None,
- air_density: Tensor | None = None,
- stream_velocity: Tensor | None = None,
- factor_type: Literal['surface', 'volume', 'auto'] = 'auto',
Unscale the model outputs based on the configured scaling factors.
The unscaling is included here to make it a consistent interface regardless of the scaling factors and type used.
- physicsnemo.datapipes.cae.transolver_datapipe.poisson_sample_indices_fixed(N: int, k: int, device=None)[source]#
This function is a nearly uniform sampler of indices for when the number of indices to sample is very, very large. It’s useful when the number of indices to sample is larger than 2^24 and torch multinomial can’t work. Unlike using randperm, there is no need to materialize and randomize the entire tensor of indices.