Active Learning#
PhysicsNeMo provides a highly flexible and extensive framework for building active learning workflows for scientific machine learning applications. In this section, we will provide some motivation and background on active learning, discuss the abstraction provided by PhysicsNeMo, and then provide some concrete examples of how to compose an active learning workflow.
Introduction#
The premise for active learning is to arrive at a desired model performance with a minimal number of training samples: the inspiration comes from teaching pedagogy, where students will actively ask for information that will clarify their own internal understanding/knowledge. In the context of machine learning, active learning is semantically the same: a model (referred to as a “learner”) is trained on some initial training data, and afterwards is used to select data points based on some heuristic that will maximially improve model performance when fine-tuned using the new data. This process continues until the model converges to some desired threshold.
For domain-specific applications, active learning has shown promise in computational fluid dynamics (CFD) and physics-based simulations. Rygiel et al. [1] demonstrate active learning for deep learning-based hemodynamic parameter estimation, and similar workflows can be implemented using the tools provided in PhysicsNeMo.
PhysicsNeMo Active Learning Abstraction#
In PhysicsNeMo, active learning comprises phases that are run sequentially in a core loop:
Training/fine-tuning: A “learner” or surrogate model is initially trained on available data, and in subsequent active learning iterations, is fine-tuned with the new data appended on the original dataset.
Querying: One or more strategies that encode some heuristics for what new data is most informative for the learner. Examples of this include uncertainty-based methods, where a model quantifies its own uncertainty over a pool of unlabeled data, and the most uncertain data points are selected for labeling.
Labeling: A method of obtaining ground truth (labels) for new data points, pipelined from the querying stage. This may entail running an expensive (external) solver, simulator, or workflow up to or acquiring experimental data.
Generally speaking, this is sufficient for most research workflows. For production, where observability is critical for “business logic”, we define an additional phase called metrology, which allows users to quantify active learning success in a way that extends beyond simple validation metrics, e.g. running a surrogate model through a solver/simulator/workflow to assess model performance and stability. The metrology phase is optional and run following training/fine-tuning.
The primary pattern for constructing active learning components is structural
sub-typing: the module physicsnemo.active_learning.protocols contains
Python Protocol s that describe the expected interface, which for many intents
and purposes, can be directly subclassed as you would any other Python (abstract)
base class. The benefit of defining typing.Protocol instead of abc.ABC
is so that inheritance is not required—as long as at runtime objects provide
the expected methods and attributes, users can freely adapt and extend existing
code to use with PhysicsNeMo’s active learning.
To implement the phases of the core loop, we generally refer to components as
“strategies”, i.e. classes that encode how and what to do for a particular
part of active learning. The base protocol for this is
physicsnemo.active_learning.protocols.ActiveLearningProtocol,
which provides common aspects for all strategies such as communication with
the active learning orchestrator, logging, and so on.
- class physicsnemo.active_learning.protocols.ActiveLearningProtocol(*args: Any, **kwargs: Any)[source]#
Bases:
ProtocolThis protocol acts as a basis for all active learning protocols.
This ensures that all protocols have some common interface, for example the ability to
attach()to another object for scope management.- __protocol_name__#
The name of the protocol. This is primarily used for
reprandstrf-strings. This should be defined by concrete implementations.- Type:
str
- _args#
A dictionary of arguments that were used to instantiate the protocol. This is used for serialization and deserialization of the protocol, and follows the same pattern as the
_argsattribute ofphysicsnemo.Module.- Type:
dict[str, Any]
- attach(self, other: object) None:[source]#
This method is used to attach the current object to another, allowing the protocol to access the attached object’s scope. The use case for this is to allow a protocol access to the driver’s scope to access dataset, model, etc. as needed. This needs to be implemented by concrete implementations.
- is_attached: bool
Whether the current object is attached to another object. This is left abstract, as it depends on how
attach()is implemented.
- logger: Logger
The logger for this protocol. This is used to log information about the protocol’s progress.
- _setup_logger(self) None:[source]#
This method is used to setup the logger for the protocol. The default implementation is to configure the logger similarly to how
physicsnemologgers are configured.
See also
QueryStrategyQuery strategy protocol (child)
LabelStrategyLabel strategy protocol (child)
MetrologyStrategyMetrology strategy protocol (child)
DriverProtocolMain orchestrator that uses these protocols
- attach(other: object) None[source]#
This method is used to attach another object to the current protocol, allowing the attached object to access the scope of this protocol. The primary reason for this is to allow the protocol to access things like the dataset, the learner model, etc. as needed.
Example use cases would be for a query strategy to access the
unlabeled_pool; for a metrology strategy to access thevalidation_pool, and for any strategy to be able to access the surrogate/learner model.This method can be as simple as setting
self.driver = other, but is left abstract in case there are other potential use cases where multiple protocols could share information.- Parameters:
other (object) – The object to attach to.
- property checkpoint_dir: Path#
Utility property for strategies to conveniently access the checkpoint directory.
This is useful for (de)serializing data tied to checkpointing.
- Returns:
The checkpoint directory, which includes the active learning step index.
- Return type:
Path
- Raises:
RuntimeError – If the strategy is not attached to a driver yet.
- property is_attached: bool#
Property to check if the current object is already attached.
This is left abstract, as it depends on how
attachis implemented.- Returns:
True if the current object is attached, False otherwise.
- Return type:
bool
- property logger: Logger#
Property to access the logger for this protocol.
If the logger has not been configured yet, the property will call the _setup_logger method to configure it.
- Returns:
The logger for this protocol.
- Return type:
Logger
- property strategy_dir: Path#
Returns the directory where the underlying strategy can use to persist data.
Depending on the strategy abstraction, further nesting may be required (e.g active learning step index, phase, etc.).
- Returns:
The directory where the metrology strategy will persist its records.
- Return type:
Path
- Raises:
RuntimeError – If the metrology strategy is not attached to a driver yet.
Strategies then branch from this base protocol into specialized protocols:
Model Weight Updates#
These three strategies are functionally similar, and so their inheritance is relatively straightforward. When it comes to model training and inference, however, we abstract out the logic into more atomic protocols:
physicsnemo.active_learning.protocols.TrainingProtocolDefines the per-step logic for training a model. Effectively, a function that accepts a model, some data, and returns a loss tensor that is
backward-ready.
physicsnemo.active_learning.protocols.ValidationProtocolDefines the per-step logic for validating a model. Mirrors the training protocol, but does not expect a loss tensor to be returned.
physicsnemo.active_learning.protocols.InferenceProtocolDefines the per-step logic for model inference outside of training and validation. This may involve performing additional operations that are not necessary for training or validation, e.g. disabling gradient computation, does not expect ground truth values, etc.
The first two protocols are then composed together to form a training loop;
the physicsnemo.active_learning.protocols.TrainingLoop protocol.
This protocol defines a functional interface for the epoch-level logic
for model training and fine-tuning. However, because the typical training
loop is relatively generalized, we provide a default implementation in the
form of physicsnemo.active_learning.loop.DefaultTrainingLoop that
implements the TrainingLoop protocol—the end-user only needs to provide
the per-step functions (i.e. concrete TrainingProtocol and ValidationProtocol
implementations) to use with the default training loop.
Interphase communication#
There are two primary mechanisms for interphase communication: through the use
of queues, and by “attaching” a strategy to the active learning driver. The former
is used primarily to pass data between querying and labeling strategies, i.e. the
querying phase enqueues data for the labeling process to consume, and subsequently
enqueued to be appended to the training pool. To this end, we expect a highly generic
queue interface described by
physicsnemo.active_learning.protocols.AbstractQueue; nominally this
just needs to be a first-in-first-out (FIFO) queue, but generality is important
to allow multiprocessing (e.g. torch.multiprocessing) or even more advanced
queues for multi-node and/or asynchronous workflows. In the simplest case, a
queue.Queue will suffice.
The other mechanism operates by attaching a strategy to the active learning driver, which allows the two classes to share scope. This is useful for sharing state, as well as accessing attributes like data pools, models, etc. as needed. This will be further discussed later.
Active Learning Driver#
Once concrete implementations of the strategies outlined above are defined,
all of the necessary components are nominally available to construct an
active learning workflow. Orchestration of the workflow is defined by the
physicsnemo.active_learning.protocols.DriverProtocol class,
which primarily defines the attributes expected for active learning. Since
there is a significant amount of boilerplate, we also provide a default
implementation in physicsnemo.active_learning.driver.Driver,
which implements the DriverProtocol and can be used out of the box
with flexibility for customization as required.
The Driver class serves as the central orchestrator for the active learning
process, managing the execution of all phases and coordinating communication
between components. At its core, the driver maintains references to:
Configuration objects:
DriverConfigfor infrastructure settings (batch size, logging, distributed training),StrategiesConfigfor active learning strategies, andTrainingConfigfor training components.The learner model: Either a
physicsnemo.Moduleor any model implementing the protocolphysicsnemo.active_learning.protocols.LearnerProtocol.Data pools: Training, validation, and unlabeled data pools used throughout the workflow.
Queues: FIFO queues for passing data between query and labeling phases.
The driver executes active learning iterations through the active_learning_step
method, which orchestrates the four phases in sequence: training, metrology, query,
and labeling. Each phase can be selectively enabled or disabled through configuration
flags (skip_training, skip_metrology, skip_labeling), allowing you to
customize the workflow for your specific use case. For example, you might disable
training to perform pure inference and querying, or disable labeling for exploratory
analysis.
Key features of the driver include:
Distributed training support: Automatic integration with PhysicsNeMo’s
DistributedManagerfor multi-GPU and multi-node training. The driver handles model wrapping withDistributedDataParallel, distributed sampling, and synchronization barriers between phases.Checkpointing: Comprehensive checkpoint management that saves active learning state (configurations, queues, step index, phase), model weights, and training state (optimizer, scheduler). Checkpoints can be saved at configurable intervals and allow seamless resumption of experiments from any phase.
Flexible training: Support for both initial training and fine-tuning with separate epoch limits and learning rates. The driver can optionally reset optimizer states between active learning iterations for consistent fine-tuning.
Logging: Specialized logging that automatically includes active learning context (step index, phase) in all log messages for better observability.
The driver provides both run and active_learning_step methods. Use run
to execute the complete active learning loop until the maximum number of steps is
reached, or call active_learning_step directly for more granular control over
individual iterations. The driver can also be called directly (driver()) as
syntactic sugar for driver.run().
For checkpoint resumption, use the load_checkpoint class method to reconstruct
a driver instance from a saved checkpoint directory. This method handles loading
all configurations, model weights, optimizer state, and queue contents, allowing
you to resume experiments exactly where they left off.
Configuration of the Driver behavior is handled by a set of config schemas
implemented as dataclasses in physicsnemo.active_learning.config:
These config schemas are passed into the Driver constructor to change aspects
such as the batch size, number of active learning steps, how optimization is
performed (with what optimizer and learning rate scheduler), checkpointing,
and so on.
Example Usage#
We have curated a number of examples to both motivate the use of active learning
as well as to demonstrate how to construct and execute workflows with PhysicsNeMo.
Examples can be found in physicsnemo/examples/active_learning, and we will
cover the high-level concepts here.
Two Moons Classification#
This is a classic data science problem, and is useful from a pedagogical perspective
to motivate and demonstrate the use and analysis of active learning; the code for
this example can be found in physicsnemo/examples/active_learning/moons. The
context behind this example is binary classification of points in 2D space that
come from two distinct distributions—i.e. the “two moons”. The active learning
use case here is somewhat artificial, but serves to communicate the basic concepts,
the workflow, and how to understand the results.
The high level description of the code contained in the example is as follows:
We define a simple MLP classifier that takes in 2D coordinates and outputs logits for a binary classification task. The model subclasses
physicsnemo.Moduleto take advantage of checkpointing.We define a query strategy that selects points that are most uncertain, i.e. have predictions closest to 0.5. This is implemented by the
ClassifierUQQueryclass inmoon_strategies.py.We define a labeling strategy that labels the points selected by the query strategy. This is implemented by the
DummyLabelStrategyclass inmoon_strategies.py, and is called so because there is no real “labeling” process required as the points are already labeled, and subsequently just “releases” those points to the training pool.We define a metrology strategy that measures the precision, recall, and F1 score of the model. Since this is a simple example it does not justify metrology perfectly, but here it serves to show the phase being used to track active learning progress. The
F1Metrologystrategy is implemented inmoon_strategies.py.The script to run is
moon_example.py, and shows how all of the components are put together to form the full workflow.
To execute the example, run the following after installing PhysicsNeMo:
python moon_example.py
This will create an active_learning_logs/<run_id>/ directory containing:
Model checkpoints:
.mdlusfiles saved according tocheckpoint_intervalDriver logs:
driver_log.jsontracking the active learning processMetrology outputs:
f1_metrology.jsonwith precision/recall/F1 scores over iterations
The main thing to monitor in this experiment is the f1_metrology.json output,
which is a product of the F1Metrology strategy: in here, we compute precision/recall/F1 values, which can then be plotted against the number of active learning steps
to show how the model performance improves as more data points are added to the
training set—the F1 score should improve as the model nominally balances between
precision and recall. We encourage the reader to alter the configuration to see how
results change, as well as implement a random sampling query strategy to use as a
baseline comparison
Note
Given that the problem is simple, the uncertainty-based query strategy may not improve substantially over random sampling as it may simply require more data points, as opposed to maximum information content per data sample. Active learning is still a highly experimental field, and success depends heavily on problem scope, data availability, model capacity/learning dynamics, and so on.
Customizing Active Learning#
Having shown some concrete examples, we now devote this section to discussing how end-users can develop their own active learning workflows outside of composition with the provided examples. Aside from computation/execution, we anticipate the two relevant components to be querying and labeling strategies.
See also
API Documentation: PhysicsNeMo Active Learning
Querying Strategies#
A query strategy defines the heuristics for selecting which data points from the
unlabeled pool should be labeled next. The goal is to identify samples that will
maximally improve model performance when added to the training set. PhysicsNeMo
provides the physicsnemo.active_learning.protocols.QueryStrategy protocol
to standardize this interface.
To implement a custom query strategy, your class should provide:
A
max_samplesattribute specifying how many samples to query per iterationA
samplemethod that selects data points and enqueues them for labelingThe
attachmethod to access the driver’s scope (model, unlabeled pool, etc.)An
is_attachedproperty to verify attachment status
The core logic resides in the sample method, which receives a queue to populate
with selected data points. This method typically:
Accesses the unlabeled pool via the attached driver (
self.driver.unlabeled_pool)Performs inference on unlabeled samples using the current model
Computes selection criteria (uncertainty, residuals, diversity, etc.)
Selects the top
max_samplescandidates based on the criteriaEnqueues selected samples to the
query_queue
Here’s a minimal example implementing an uncertainty-based query strategy:
from physicsnemo.active_learning.protocols import QueryStrategy, AbstractQueue
import torch
from torch.utils.data import DataLoader
class UncertaintyQueryStrategy:
"""Select samples where model predictions are most uncertain."""
__protocol_name__ = "UncertaintyQuery"
def __init__(self, max_samples: int, batch_size: int = 32):
self.max_samples = max_samples
self.batch_size = batch_size
self.driver = None
def attach(self, driver):
"""Attach to the active learning driver."""
# this will be called by ``Driver.attach_strategies``
self.driver = driver
@property
def is_attached(self) -> bool:
"""Check if attached to a driver."""
# this is mainly used for exception handling
return self.driver is not None
def sample(self, query_queue: AbstractQueue, *args, **kwargs):
"""Select most uncertain samples from unlabeled pool."""
# Create dataloader for unlabeled pool; this isn't necessarily
# best practice, but transparent for this example
unlabeled_loader = DataLoader(
self.driver.unlabeled_pool,
batch_size=self.batch_size,
shuffle=False
)
# Compute uncertainty scores for all unlabeled samples
uncertainties = []
indices = []
self.driver.learner.eval()
# this model uses entropy from classification, but it can also be
# based on query-by-committee or other forms of uncertainty
# quantification
with torch.no_grad():
for idx, batch in enumerate(unlabeled_loader):
# Run inference
predictions = self.driver.learner(batch)
# Compute uncertainty (e.g., entropy for classification)
probs = torch.softmax(predictions, dim=-1)
entropy = -torch.sum(probs * torch.log(probs + 1e-10), dim=-1)
uncertainties.append(entropy)
indices.extend(range(idx * self.batch_size,
idx * self.batch_size + len(batch)))
# Concatenate all uncertainties
all_uncertainties = torch.cat(uncertainties)
# Select top-k most uncertain samples
top_k_indices = torch.topk(all_uncertainties,
k=min(self.max_samples, len(all_uncertainties))).indices
# Enqueue selected samples
for idx in top_k_indices:
sample = self.driver.unlabeled_pool[indices[idx]]
query_queue.put(sample)
self.logger.info(f"Queried {len(top_k_indices)} samples for labeling")
This example demonstrates the key components: accessing the driver’s resources, performing inference to compute selection criteria, and enqueuing the most informative samples. The strategy can be as simple or sophisticated as needed—physics-informed residuals, ensemble disagreement, and diversity-based sampling are all valid approaches that follow this same pattern.
Tip
The Driver abstraction is designed to facilitate a list of query strategies,
with the idea that for production use-cases, multiple heuristics could be combined
to increase sampling efficiency. The recommended approach is to modularize query
strategies into separate query classes, and combine them within Driver as
needed.
Labeling Strategies#
A label strategy defines how to obtain ground truth labels for data points selected by query strategies. While query strategies identify which samples need labels, label strategies define how to generate those labels. In scientific machine learning applications, this often involves calling external solvers, simulators, or computational workflows that may be expensive and time-consuming.
PhysicsNeMo provides the physicsnemo.active_learning.protocols.LabelStrategy
protocol to standardize this interface. To implement a custom label strategy, your class
should provide:
A
labelmethod that consumes queued samples and produces labeled dataThe
attachmethod to access the driver’s scope if neededAn
is_attachedproperty to verify attachment statusA
__is_external_process__attribute indicating whether external processes are usedA
__provides_fields__attribute indicating which fields of a data structure the label strategy provides
The core logic resides in the label method, which receives two queues: one
containing data to be labeled (populated by query strategies), and another for
enqueuing newly labeled data back to the training pool. This method typically:
Dequeues samples from the
queue_to_labelProcesses each sample to obtain ground truth labels (via simulation, solver, etc.)
Enqueues labeled samples to the
serialize_queuefor integration into training data
Tip
When calling external processes, consider implementing retry logic for transient failures, or some way to gracefully handle fail states as to not interrupt the automated active learning workflow. You can consider serializing these samples for manual retrying.
Here’s an example implementing a label strategy that calls an external CFD solver, and packs the result into a dictionary:
import subprocess
import json
from pathlib import Path
from physicsnemo.active_learning.protocols import LabelStrategy, AbstractQueue
class CFDSolverLabelStrategy:
"""Label samples by running an external CFD solver."""
__protocol_name__ = "CFDSolverLabel"
__is_external_process__ = True
__provides_fields__ = {"pressure", "velocity"}
def __init__(
self,
solver_executable: str,
working_dir: Path,
timeout: int = 3600,
):
"""
Initialize the CFD solver label strategy.
Parameters
----------
solver_executable : str
Path to the external solver executable
working_dir : Path
Directory for solver input/output files
timeout : int
Maximum time (seconds) to wait for solver completion
"""
self.solver_executable = solver_executable
self.working_dir = Path(working_dir)
self.timeout = timeout
self.driver = None
# Ensure working directory exists
self.working_dir.mkdir(parents=True, exist_ok=True)
def label(
self,
queue_to_label: AbstractQueue,
serialize_queue: AbstractQueue,
) -> None:
"""
Label queued samples by running external CFD solver.
For each sample in the queue, this method:
1. Writes solver input files with sample parameters
2. Invokes the external solver via subprocess
3. Parses solver output to extract labels
4. Enqueues labeled data for training integration
Parameters
----------
queue_to_label : AbstractQueue
Queue containing unlabeled samples from query strategies
serialize_queue : AbstractQueue
Queue for enqueuing labeled samples
"""
sample_idx = 0
while not queue_to_label.empty():
sample = queue_to_label.get()
# Create unique directory for this sample
sample_dir = self.working_dir / f"sample_{sample_idx}"
sample_dir.mkdir(exist_ok=True)
# Write solver input files (format depends on your solver)
input_file = sample_dir / "input.json"
with open(input_file, "w") as f:
json.dump(sample, f)
# Invoke external solver
try:
result = subprocess.run(
[
self.solver_executable,
"--input", str(input_file),
"--output", str(sample_dir / "output.json"),
],
cwd=sample_dir,
timeout=self.timeout,
capture_output=True,
text=True,
check=True,
)
self.logger.info(
f"Solver completed for sample {sample_idx} "
f"(return code: {result.returncode})"
)
except subprocess.TimeoutExpired:
self.logger.error(
f"Solver timeout for sample {sample_idx} "
f"after {self.timeout}s"
)
continue
except subprocess.CalledProcessError as e:
self.logger.error(
f"Solver failed for sample {sample_idx}: {e.stderr}"
)
continue
# Parse solver output and create labeled sample
output_file = sample_dir / "output.json"
if output_file.exists():
with open(output_file) as f:
solver_output = json.load(f)
# Combine input sample with solver-generated labels
labeled_sample = {
**sample,
"pressure": solver_output["pressure"],
"velocity": solver_output["velocity"],
}
# Enqueue for integration into training pool
serialize_queue.put(labeled_sample)
sample_idx += 1
def attach(self, driver) -> None:
"""Attach to the active learning driver."""
self.driver = driver
@property
def is_attached(self) -> bool:
"""Check if attached to a driver."""
return self.driver is not None
This example assumes the following which may not be relevant to your case:
The solver executable is in the
PATH, and is invoked relatively simply with a few command line arguments. If a Python API is available, it is recommended to use API bindings instead ofsubprocess.The solver output is a JSON file that can be parsed easily. We recommend modularizing the parsing logic into a separate function or class.
Warning
For computationally expensive solvers, consider parallelizing the labeling process
across multiple workers or compute nodes. While the queue-based interface is designed
so that labeling could be parallelized, currently the default Driver
implementation does not support parallel or asynchronous labeling and we welcome
users to recommend use-cases for consideration.
Practical Considerations for Scientific ML#
This section discusses practical considerations when developing active learning workflows for scientific computing applications.
Baseline Workflow Development#
When developing a new active learning workflow, establishing a baseline is critical for validating infrastructure correctness before introducing complexity. A common pattern is to use pre-computed datasets with a pass-through label strategy:
class DummyLabelStrategy(LabelStrategy):
"""Pass-through labeling for pre-computed datasets."""
__is_external_process__ = False
def label(self, queue_to_label: Queue, serialize_queue: Queue) -> None:
"""Transfer samples from query queue to serialize queue."""
while not queue_to_label.empty():
sample = queue_to_label.get()
serialize_queue.put(sample)
This approach enables validation of the active learning loop, checkpointing, and data management without the complexity of external solver integration.
Query Strategy Selection#
The choice of query strategy involves trade-offs between computational resources, querying speed, and domain knowledge requirements. A few common approaches are listed below:
Random Sampling
Random query strategies serve as essential baselines for comparing more sophisticated approaches. In many scenarios, particularly during early training phases when models are learning basic patterns, random sampling can perform comparably to more complex methods while requiring minimal computational overhead.
Uncertainty Quantification
Ensemble methods quantify model uncertainty and measure prediction variance. The model can be designed to output prediction uncertainties or a deep ensemble (a committee of models) can be trained to get ensemble predictions. This approach offers:
Uncertainty estimates useful beyond active learning
If training a deep-ensemble, linear scaling of training cost with ensemble size
Note
Selecting only the highest uncertainty samples may lead to biased training sets. Consider hybrid strategies that combine uncertainty-based selection with random sampling (e.g., 60% high uncertainty, 40% random) to maintain distribution coverage.
Physics Residuals
Physics-based query strategies select samples where model predictions exhibit the highest violation of governing equations. This approach offers:
Explicit identification of physics-violating samples
Domain knowledge incorporation into sample selection
The physics-informed approach are suitable when memory constraints limit deep ensemble training and when mesh data and residual computation infrastructure are available.
Solver Integration for Labeling#
Production active learning workflows typically require integration with external solvers to generate labels for queried samples. A systematic integration approach involves several stages:
Development Workflow
Initial development should use pre-computed datasets with the pass-through labeling strategy described earlier. This validates query strategies and training infrastructure before introducing solver integration complexity.
Directory Management
Solver integrations should use isolated directories for each sample to prevent conflicts and enable parallel execution in future extensions:
class SolverLabelStrategy(LabelStrategy): def __init__(self, solver_path: str, working_dir: Path): self.solver_path = solver_path self.working_dir = Path(working_dir) self.working_dir.mkdir(parents=True, exist_ok=True) def label(self, queue_to_label: Queue, serialize_queue: Queue) -> None: sample_id = 0 while not queue_to_label.empty(): sample = queue_to_label.get() # Create isolated directory for this sample sample_dir = self.working_dir / f"sample_{sample_id:06d}" sample_dir.mkdir(exist_ok=True) # Run solver and parse results labeled_data = self._run_solver(sample, sample_dir) serialize_queue.put(labeled_data) sample_id += 1
Solver Parameterization
Solver configuration is typically passed via command-line arguments, though the specific interface varies by solver. A representative pattern for subprocess-based invocation:
def _run_solver(self, sample: dict, sample_dir: Path) -> dict:
"""Execute solver and return labeled data."""
# Serialize sample parameters to solver input format
input_file = sample_dir / "input.json"
with open(input_file, "w") as f:
json.dump({
"geometry": sample["geometry_params"],
"flow_conditions": sample["boundary_conditions"],
}, f)
# Construct solver command with parameterized arguments
output_file = sample_dir / "output.json"
cmd = [
self.solver_path,
"--input", str(input_file),
"--output", str(output_file),
"--mesh-size", "fine",
"--convergence-tol", "1e-6",
]
# Execute solver with timeout
result = subprocess.run(
cmd,
cwd=sample_dir,
capture_output=True,
text=True,
timeout=3600,
)
if result.returncode != 0:
self.logger.error(f"Solver failed: {result.stderr}")
raise RuntimeError("Solver execution failed")
# Parse and structure solver output
with open(output_file) as f:
solver_results = json.load(f)
return {
**sample,
"pressure": solver_results["p"],
"velocity": solver_results["U"],
}
The command-line arguments (--input, --output, solver-specific flags) should
be adapted to match the specific solver’s interface.
A good example is the AirFRANS dataset (Extrality/NACA_simulation) where OpenFOAM configuration is wrapped with python.
Production Deployment Considerations#
Several operational aspects become important when deploying active learning workflows in production environments.
Monitoring and Metrics
Comprehensive monitoring extends beyond validation loss to include:
Per-field error metrics (e.g., separate errors for velocity, pressure, temperature)
Sample efficiency curves relating performance to dataset size
Query strategy effectiveness indicators
Solver performance statistics (execution time, convergence behavior, failure rates)
These metrics enable informed decisions about workflow termination and strategy effectiveness.
Learning Rate Schedules
Active learning requires distinct learning rate schedules for initial training and subsequent fine-tuning phases. A representative configuration:
Initial training (step 0): CosineAnnealingLR from 1e-3 to 1e-6 over 100 epochs
Fine-tuning (steps 1+): ExponentialLR from 5e-4 to 5e-6 over 10 epochs
The lower initial learning rate and shorter schedule for fine-tuning helps prevent catastrophic forgetting of previously learned patterns.
Termination Criteria
Active learning workflows should terminate when additional data provides diminishing returns. Common termination indicators include:
Validation error plateaus over consecutive iterations
Marginal performance improvements no longer justify solver costs
Exhaustion of unlabeled data pool or computational budget