Active Learning #

PhysicsNeMo provides a highly flexible and extensive framework for building active learning workflows for scientific machine learning applications. In this section, we will provide some motivation and background on active learning, discuss the abstraction provided by PhysicsNeMo, and then provide some concrete examples of how to compose an active learning workflow.

Introduction #

The premise for active learning is to arrive at a desired model performance with a minimal number of training samples: the inspiration comes from teaching pedagogy, where students will actively ask for information that will clarify their own internal understanding/knowledge. In the context of machine learning, active learning is semantically the same: a model (referred to as a “learner”) is trained on some initial training data, and afterwards is used to select data points based on some heuristic that will maximially improve model performance when fine-tuned using the new data. This process continues until the model converges to some desired threshold.

For domain-specific applications, active learning has shown promise in computational fluid dynamics (CFD) and physics-based simulations. Rygiel et al. [1] demonstrate active learning for deep learning-based hemodynamic parameter estimation, and similar workflows can be implemented using the tools provided in PhysicsNeMo.

PhysicsNeMo Active Learning Abstraction #

In PhysicsNeMo, active learning comprises phases that are run sequentially in a core loop:

Training/fine-tuning: A “learner” or surrogate model is initially trained on available data, and in subsequent active learning iterations, is fine-tuned with the new data appended on the original dataset.
Querying: One or more strategies that encode some heuristics for what new data is most informative for the learner. Examples of this include uncertainty-based methods, where a model quantifies its own uncertainty over a pool of unlabeled data, and the most uncertain data points are selected for labeling.
Labeling: A method of obtaining ground truth (labels) for new data points, pipelined from the querying stage. This may entail running an expensive (external) solver, simulator, or workflow up to or acquiring experimental data.

Generally speaking, this is sufficient for most research workflows. For production, where observability is critical for “business logic”, we define an additional phase called metrology, which allows users to quantify active learning success in a way that extends beyond simple validation metrics, e.g. running a surrogate model through a solver/simulator/workflow to assess model performance and stability. The metrology phase is optional and run following training/fine-tuning.

The primary pattern for constructing active learning components is structural sub-typing: the module physicsnemo.active_learning.protocols contains Python Protocol s that describe the expected interface, which for many intents and purposes, can be directly subclassed as you would any other Python (abstract) base class. The benefit of defining typing.Protocol instead of abc.ABC is so that inheritance is not required—as long as at runtime objects provide the expected methods and attributes, users can freely adapt and extend existing code to use with PhysicsNeMo’s active learning.

To implement the phases of the core loop, we generally refer to components as “strategies”, i.e. classes that encode how and what to do for a particular part of active learning. The base protocol for this is physicsnemo.active_learning.protocols.ActiveLearningProtocol, which provides common aspects for all strategies such as communication with the active learning orchestrator, logging, and so on.

class physicsnemo.active_learning.protocols.ActiveLearningProtocol(*args: Any, **kwargs: Any)[source]#

Bases: Protocol

This protocol acts as a basis for all active learning protocols.

This ensures that all protocols have some common interface, for example the ability to attach() to another object for scope management.

__protocol_name__#

The name of the protocol. This is primarily used for repr and str f-strings. This should be defined by concrete implementations.

Type:: str

_args#

A dictionary of arguments that were used to instantiate the protocol. This is used for serialization and deserialization of the protocol, and follows the same pattern as the _args attribute of physicsnemo.Module.

Type:: dict[str, Any]

attach(self, other: object) → None:[source]#: This method is used to attach the current object to another, allowing the protocol to access the attached object’s scope. The use case for this is to allow a protocol access to the driver’s scope to access dataset, model, etc. as needed. This needs to be implemented by concrete implementations.

is_attached: bool: Whether the current object is attached to another object. This is left abstract, as it depends on how attach() is implemented.

logger: Logger: The logger for this protocol. This is used to log information about the protocol’s progress.

_setup_logger(self) → None:[source]#: This method is used to setup the logger for the protocol. The default implementation is to configure the logger similarly to how physicsnemo loggers are configured.

Model Weight Updates #

These three strategies are functionally similar, and so their inheritance is relatively straightforward. When it comes to model training and inference, however, we abstract out the logic into more atomic protocols:

physicsnemo.active_learning.protocols.TrainingProtocol
- Defines the per-step logic for training a model. Effectively, a function that accepts a model, some data, and returns a loss tensor that is backward-ready.
physicsnemo.active_learning.protocols.ValidationProtocol
- Defines the per-step logic for validating a model. Mirrors the training protocol, but does not expect a loss tensor to be returned.
physicsnemo.active_learning.protocols.InferenceProtocol
- Defines the per-step logic for model inference outside of training and validation. This may involve performing additional operations that are not necessary for training or validation, e.g. disabling gradient computation, does not expect ground truth values, etc.

The first two protocols are then composed together to form a training loop; the physicsnemo.active_learning.protocols.TrainingLoop protocol. This protocol defines a functional interface for the epoch-level logic for model training and fine-tuning. However, because the typical training loop is relatively generalized, we provide a default implementation in the form of physicsnemo.active_learning.loop.DefaultTrainingLoop that implements the TrainingLoop protocol—the end-user only needs to provide the per-step functions (i.e. concrete TrainingProtocol and ValidationProtocol implementations) to use with the default training loop.

Interphase communication #

There are two primary mechanisms for interphase communication: through the use of queues, and by “attaching” a strategy to the active learning driver. The former is used primarily to pass data between querying and labeling strategies, i.e. the querying phase enqueues data for the labeling process to consume, and subsequently enqueued to be appended to the training pool. To this end, we expect a highly generic queue interface described by physicsnemo.active_learning.protocols.AbstractQueue; nominally this just needs to be a first-in-first-out (FIFO) queue, but generality is important to allow multiprocessing (e.g. torch.multiprocessing) or even more advanced queues for multi-node and/or asynchronous workflows. In the simplest case, a queue.Queue will suffice.

The other mechanism operates by attaching a strategy to the active learning driver, which allows the two classes to share scope. This is useful for sharing state, as well as accessing attributes like data pools, models, etc. as needed. This will be further discussed later.

Active Learning Driver #

Once concrete implementations of the strategies outlined above are defined, all of the necessary components are nominally available to construct an active learning workflow. Orchestration of the workflow is defined by the physicsnemo.active_learning.protocols.DriverProtocol class, which primarily defines the attributes expected for active learning. Since there is a significant amount of boilerplate, we also provide a default implementation in physicsnemo.active_learning.driver.Driver, which implements the DriverProtocol and can be used out of the box with flexibility for customization as required.

The Driver class serves as the central orchestrator for the active learning process, managing the execution of all phases and coordinating communication between components. At its core, the driver maintains references to:

Configuration objects: DriverConfig for infrastructure settings (batch size, logging, distributed training), StrategiesConfig for active learning strategies, and TrainingConfig for training components.
The learner model: Either a physicsnemo.Module or any model implementing the protocol physicsnemo.active_learning.protocols.LearnerProtocol.
Data pools: Training, validation, and unlabeled data pools used throughout the workflow.
Queues: FIFO queues for passing data between query and labeling phases.

The driver executes active learning iterations through the active_learning_step method, which orchestrates the four phases in sequence: training, metrology, query, and labeling. Each phase can be selectively enabled or disabled through configuration flags (skip_training, skip_metrology, skip_labeling), allowing you to customize the workflow for your specific use case. For example, you might disable training to perform pure inference and querying, or disable labeling for exploratory analysis.

Key features of the driver include:

Distributed training support: Automatic integration with PhysicsNeMo’s DistributedManager for multi-GPU and multi-node training. The driver handles model wrapping with DistributedDataParallel, distributed sampling, and synchronization barriers between phases.
Checkpointing: Comprehensive checkpoint management that saves active learning state (configurations, queues, step index, phase), model weights, and training state (optimizer, scheduler). Checkpoints can be saved at configurable intervals and allow seamless resumption of experiments from any phase.
Flexible training: Support for both initial training and fine-tuning with separate epoch limits and learning rates. The driver can optionally reset optimizer states between active learning iterations for consistent fine-tuning.
Logging: Specialized logging that automatically includes active learning context (step index, phase) in all log messages for better observability.

The driver provides both run and active_learning_step methods. Use run to execute the complete active learning loop until the maximum number of steps is reached, or call active_learning_step directly for more granular control over individual iterations. The driver can also be called directly (driver()) as syntactic sugar for driver.run().

For checkpoint resumption, use the load_checkpoint class method to reconstruct a driver instance from a saved checkpoint directory. This method handles loading all configurations, model weights, optimizer state, and queue contents, allowing you to resume experiments exactly where they left off.

Configuration of the Driver behavior is handled by a set of config schemas implemented as dataclasses in physicsnemo.active_learning.config:

These config schemas are passed into the Driver constructor to change aspects such as the batch size, number of active learning steps, how optimization is performed (with what optimizer and learning rate scheduler), checkpointing, and so on.

Example Usage #

We have curated a number of examples to both motivate the use of active learning as well as to demonstrate how to construct and execute workflows with PhysicsNeMo. Examples can be found in physicsnemo/examples/active_learning, and we will cover the high-level concepts here.

Two Moons Classification #

This is a classic data science problem, and is useful from a pedagogical perspective to motivate and demonstrate the use and analysis of active learning; the code for this example can be found in physicsnemo/examples/active_learning/moons. The context behind this example is binary classification of points in 2D space that come from two distinct distributions—i.e. the “two moons”. The active learning use case here is somewhat artificial, but serves to communicate the basic concepts, the workflow, and how to understand the results.

The high level description of the code contained in the example is as follows:

We define a simple MLP classifier that takes in 2D coordinates and outputs logits for a binary classification task. The model subclasses physicsnemo.Module to take advantage of checkpointing.
We define a query strategy that selects points that are most uncertain, i.e. have predictions closest to 0.5. This is implemented by the ClassifierUQQuery class in moon_strategies.py.
We define a labeling strategy that labels the points selected by the query strategy. This is implemented by the DummyLabelStrategy class in moon_strategies.py, and is called so because there is no real “labeling” process required as the points are already labeled, and subsequently just “releases” those points to the training pool.
We define a metrology strategy that measures the precision, recall, and F1 score of the model. Since this is a simple example it does not justify metrology perfectly, but here it serves to show the phase being used to track active learning progress. The F1Metrology strategy is implemented in moon_strategies.py.
The script to run is moon_example.py, and shows how all of the components are put together to form the full workflow.

To execute the example, run the following after installing PhysicsNeMo:

python moon_example.py

This will create an active_learning_logs/<run_id>/ directory containing:

Model checkpoints: .mdlus files saved according to checkpoint_interval
Driver logs: driver_log.json tracking the active learning process
Metrology outputs: f1_metrology.json with precision/recall/F1 scores over iterations

The main thing to monitor in this experiment is the f1_metrology.json output, which is a product of the F1Metrology strategy: in here, we compute precision/recall/F1 values, which can then be plotted against the number of active learning steps to show how the model performance improves as more data points are added to the training set—the F1 score should improve as the model nominally balances between precision and recall. We encourage the reader to alter the configuration to see how results change, as well as implement a random sampling query strategy to use as a baseline comparison

Note

Given that the problem is simple, the uncertainty-based query strategy may not improve substantially over random sampling as it may simply require more data points, as opposed to maximum information content per data sample. Active learning is still a highly experimental field, and success depends heavily on problem scope, data availability, model capacity/learning dynamics, and so on.

Customizing Active Learning #

Having shown some concrete examples, we now devote this section to discussing how end-users can develop their own active learning workflows outside of composition with the provided examples. Aside from computation/execution, we anticipate the two relevant components to be querying and labeling strategies.

Querying Strategies #

A query strategy defines the heuristics for selecting which data points from the unlabeled pool should be labeled next. The goal is to identify samples that will maximally improve model performance when added to the training set. PhysicsNeMo provides the physicsnemo.active_learning.protocols.QueryStrategy protocol to standardize this interface.

To implement a custom query strategy, your class should provide:

A max_samples attribute specifying how many samples to query per iteration
A sample method that selects data points and enqueues them for labeling
The attach method to access the driver’s scope (model, unlabeled pool, etc.)
An is_attached property to verify attachment status

The core logic resides in the sample method, which receives a queue to populate with selected data points. This method typically:

Accesses the unlabeled pool via the attached driver (self.driver.unlabeled_pool)
Performs inference on unlabeled samples using the current model
Computes selection criteria (uncertainty, residuals, diversity, etc.)
Selects the top max_samples candidates based on the criteria
Enqueues selected samples to the query_queue

Here’s a minimal example implementing an uncertainty-based query strategy:

from physicsnemo.active_learning.protocols import QueryStrategy, AbstractQueue
import torch
from torch.utils.data import DataLoader

class UncertaintyQueryStrategy:
    """Select samples where model predictions are most uncertain."""

    __protocol_name__ = "UncertaintyQuery"

    def __init__(self, max_samples: int, batch_size: int = 32):
        self.max_samples = max_samples
        self.batch_size = batch_size
        self.driver = None

    def attach(self, driver):
        """Attach to the active learning driver."""
        # this will be called by ``Driver.attach_strategies``
        self.driver = driver

    @property
    def is_attached(self) -> bool:
        """Check if attached to a driver."""
        # this is mainly used for exception handling
        return self.driver is not None

    def sample(self, query_queue: AbstractQueue, *args, **kwargs):
        """Select most uncertain samples from unlabeled pool."""
        # Create dataloader for unlabeled pool; this isn't necessarily
        # best practice, but transparent for this example
        unlabeled_loader = DataLoader(
            self.driver.unlabeled_pool,
            batch_size=self.batch_size,
            shuffle=False
        )

        # Compute uncertainty scores for all unlabeled samples
        uncertainties = []
        indices = []

        self.driver.learner.eval()
        # this model uses entropy from classification, but it can also be
        # based on query-by-committee or other forms of uncertainty
        # quantification
        with torch.no_grad():
            for idx, batch in enumerate(unlabeled_loader):
                # Run inference
                predictions = self.driver.learner(batch)

                # Compute uncertainty (e.g., entropy for classification)
                probs = torch.softmax(predictions, dim=-1)
                entropy = -torch.sum(probs * torch.log(probs + 1e-10), dim=-1)

                uncertainties.append(entropy)
                indices.extend(range(idx * self.batch_size,
                                   idx * self.batch_size + len(batch)))

        # Concatenate all uncertainties
        all_uncertainties = torch.cat(uncertainties)

        # Select top-k most uncertain samples
        top_k_indices = torch.topk(all_uncertainties,
                                   k=min(self.max_samples, len(all_uncertainties))).indices

        # Enqueue selected samples
        for idx in top_k_indices:
            sample = self.driver.unlabeled_pool[indices[idx]]
            query_queue.put(sample)

        self.logger.info(f"Queried {len(top_k_indices)} samples for labeling")

This example demonstrates the key components: accessing the driver’s resources, performing inference to compute selection criteria, and enqueuing the most informative samples. The strategy can be as simple or sophisticated as needed—physics-informed residuals, ensemble disagreement, and diversity-based sampling are all valid approaches that follow this same pattern.

Tip

The Driver abstraction is designed to facilitate a list of query strategies, with the idea that for production use-cases, multiple heuristics could be combined to increase sampling efficiency. The recommended approach is to modularize query strategies into separate query classes, and combine them within Driver as needed.

Labeling Strategies #

A label strategy defines how to obtain ground truth labels for data points selected by query strategies. While query strategies identify which samples need labels, label strategies define how to generate those labels. In scientific machine learning applications, this often involves calling external solvers, simulators, or computational workflows that may be expensive and time-consuming.

PhysicsNeMo provides the physicsnemo.active_learning.protocols.LabelStrategy protocol to standardize this interface. To implement a custom label strategy, your class should provide:

A label method that consumes queued samples and produces labeled data
The attach method to access the driver’s scope if needed
An is_attached property to verify attachment status
A __is_external_process__ attribute indicating whether external processes are used
A __provides_fields__ attribute indicating which fields of a data structure the label strategy provides

The core logic resides in the label method, which receives two queues: one containing data to be labeled (populated by query strategies), and another for enqueuing newly labeled data back to the training pool. This method typically:

Dequeues samples from the queue_to_label
Processes each sample to obtain ground truth labels (via simulation, solver, etc.)
Enqueues labeled samples to the serialize_queue for integration into training data

Tip

When calling external processes, consider implementing retry logic for transient failures, or some way to gracefully handle fail states as to not interrupt the automated active learning workflow. You can consider serializing these samples for manual retrying.

Here’s an example implementing a label strategy that calls an external CFD solver, and packs the result into a dictionary:

import subprocess
import json
from pathlib import Path
from physicsnemo.active_learning.protocols import LabelStrategy, AbstractQueue

class CFDSolverLabelStrategy:
    """Label samples by running an external CFD solver."""

    __protocol_name__ = "CFDSolverLabel"
    __is_external_process__ = True
    __provides_fields__ = {"pressure", "velocity"}

    def __init__(
        self,
        solver_executable: str,
        working_dir: Path,
        timeout: int = 3600,
    ):
        """
        Initialize the CFD solver label strategy.

        Parameters
        ----------
        solver_executable : str
            Path to the external solver executable
        working_dir : Path
            Directory for solver input/output files
        timeout : int
            Maximum time (seconds) to wait for solver completion
        """
        self.solver_executable = solver_executable
        self.working_dir = Path(working_dir)
        self.timeout = timeout
        self.driver = None

        # Ensure working directory exists
        self.working_dir.mkdir(parents=True, exist_ok=True)

    def label(
        self,
        queue_to_label: AbstractQueue,
        serialize_queue: AbstractQueue,
    ) -> None:
        """
        Label queued samples by running external CFD solver.

        For each sample in the queue, this method:
        1. Writes solver input files with sample parameters
        2. Invokes the external solver via subprocess
        3. Parses solver output to extract labels
        4. Enqueues labeled data for training integration

        Parameters
        ----------
        queue_to_label : AbstractQueue
            Queue containing unlabeled samples from query strategies
        serialize_queue : AbstractQueue
            Queue for enqueuing labeled samples
        """
        sample_idx = 0
        while not queue_to_label.empty():
            sample = queue_to_label.get()

            # Create unique directory for this sample
            sample_dir = self.working_dir / f"sample_{sample_idx}"
            sample_dir.mkdir(exist_ok=True)

            # Write solver input files (format depends on your solver)
            input_file = sample_dir / "input.json"
            with open(input_file, "w") as f:
                json.dump(sample, f)

            # Invoke external solver
            try:
                result = subprocess.run(
                    [
                        self.solver_executable,
                        "--input", str(input_file),
                        "--output", str(sample_dir / "output.json"),
                    ],
                    cwd=sample_dir,
                    timeout=self.timeout,
                    capture_output=True,
                    text=True,
                    check=True,
                )

                self.logger.info(
                    f"Solver completed for sample {sample_idx} "
                    f"(return code: {result.returncode})"
                )

            except subprocess.TimeoutExpired:
                self.logger.error(
                    f"Solver timeout for sample {sample_idx} "
                    f"after {self.timeout}s"
                )
                continue

            except subprocess.CalledProcessError as e:
                self.logger.error(
                    f"Solver failed for sample {sample_idx}: {e.stderr}"
                )
                continue

            # Parse solver output and create labeled sample
            output_file = sample_dir / "output.json"
            if output_file.exists():
                with open(output_file) as f:
                    solver_output = json.load(f)

                # Combine input sample with solver-generated labels
                labeled_sample = {
                    **sample,
                    "pressure": solver_output["pressure"],
                    "velocity": solver_output["velocity"],
                }

                # Enqueue for integration into training pool
                serialize_queue.put(labeled_sample)
            sample_idx += 1

    def attach(self, driver) -> None:
        """Attach to the active learning driver."""
        self.driver = driver

    @property
    def is_attached(self) -> bool:
        """Check if attached to a driver."""
        return self.driver is not None

This example assumes the following which may not be relevant to your case:

The solver executable is in the PATH, and is invoked relatively simply with a few command line arguments. If a Python API is available, it is recommended to use API bindings instead of subprocess.
The solver output is a JSON file that can be parsed easily. We recommend modularizing the parsing logic into a separate function or class.

Warning

For computationally expensive solvers, consider parallelizing the labeling process across multiple workers or compute nodes. While the queue-based interface is designed so that labeling could be parallelized, currently the default Driver implementation does not support parallel or asynchronous labeling and we welcome users to recommend use-cases for consideration.

Practical Considerations for Scientific ML #

This section discusses practical considerations when developing active learning workflows for scientific computing applications.

Baseline Workflow Development #

When developing a new active learning workflow, establishing a baseline is critical for validating infrastructure correctness before introducing complexity. A common pattern is to use pre-computed datasets with a pass-through label strategy:

class DummyLabelStrategy(LabelStrategy):
    """Pass-through labeling for pre-computed datasets."""

    __is_external_process__ = False

    def label(self, queue_to_label: Queue, serialize_queue: Queue) -> None:
        """Transfer samples from query queue to serialize queue."""
        while not queue_to_label.empty():
            sample = queue_to_label.get()
            serialize_queue.put(sample)

This approach enables validation of the active learning loop, checkpointing, and data management without the complexity of external solver integration.

Query Strategy Selection #

The choice of query strategy involves trade-offs between computational resources, querying speed, and domain knowledge requirements. A few common approaches are listed below:

Random Sampling

Random query strategies serve as essential baselines for comparing more sophisticated approaches. In many scenarios, particularly during early training phases when models are learning basic patterns, random sampling can perform comparably to more complex methods while requiring minimal computational overhead.

Uncertainty Quantification

Ensemble methods quantify model uncertainty and measure prediction variance. The model can be designed to output prediction uncertainties or a deep ensemble (a committee of models) can be trained to get ensemble predictions. This approach offers:

Uncertainty estimates useful beyond active learning
If training a deep-ensemble, linear scaling of training cost with ensemble size

Note

Selecting only the highest uncertainty samples may lead to biased training sets. Consider hybrid strategies that combine uncertainty-based selection with random sampling (e.g., 60% high uncertainty, 40% random) to maintain distribution coverage.

Physics Residuals

Physics-based query strategies select samples where model predictions exhibit the highest violation of governing equations. This approach offers:

Explicit identification of physics-violating samples
Domain knowledge incorporation into sample selection

The physics-informed approach are suitable when memory constraints limit deep ensemble training and when mesh data and residual computation infrastructure are available.

Solver Integration for Labeling #

Production active learning workflows typically require integration with external solvers to generate labels for queried samples. A systematic integration approach involves several stages:

Development Workflow

Initial development should use pre-computed datasets with the pass-through labeling strategy described earlier. This validates query strategies and training infrastructure before introducing solver integration complexity.

Directory Management

Solver integrations should use isolated directories for each sample to prevent conflicts and enable parallel execution in future extensions:

class SolverLabelStrategy(LabelStrategy):
    def __init__(self, solver_path: str, working_dir: Path):
        self.solver_path = solver_path
        self.working_dir = Path(working_dir)
        self.working_dir.mkdir(parents=True, exist_ok=True)

    def label(self, queue_to_label: Queue, serialize_queue: Queue) -> None:
        sample_id = 0
        while not queue_to_label.empty():
            sample = queue_to_label.get()

            # Create isolated directory for this sample
            sample_dir = self.working_dir / f"sample_{sample_id:06d}"
            sample_dir.mkdir(exist_ok=True)

            # Run solver and parse results
            labeled_data = self._run_solver(sample, sample_dir)
            serialize_queue.put(labeled_data)
            sample_id += 1

Solver Parameterization

Solver configuration is typically passed via command-line arguments, though the specific interface varies by solver. A representative pattern for subprocess-based invocation:

def _run_solver(self, sample: dict, sample_dir: Path) -> dict:
    """Execute solver and return labeled data."""
    # Serialize sample parameters to solver input format
    input_file = sample_dir / "input.json"
    with open(input_file, "w") as f:
        json.dump({
            "geometry": sample["geometry_params"],
            "flow_conditions": sample["boundary_conditions"],
        }, f)

    # Construct solver command with parameterized arguments
    output_file = sample_dir / "output.json"
    cmd = [
        self.solver_path,
        "--input", str(input_file),
        "--output", str(output_file),
        "--mesh-size", "fine",
        "--convergence-tol", "1e-6",
    ]

    # Execute solver with timeout
    result = subprocess.run(
        cmd,
        cwd=sample_dir,
        capture_output=True,
        text=True,
        timeout=3600,
    )

    if result.returncode != 0:
        self.logger.error(f"Solver failed: {result.stderr}")
        raise RuntimeError("Solver execution failed")

    # Parse and structure solver output
    with open(output_file) as f:
        solver_results = json.load(f)

    return {
        **sample,
        "pressure": solver_results["p"],
        "velocity": solver_results["U"],
    }

The command-line arguments (--input, --output, solver-specific flags) should be adapted to match the specific solver’s interface.

A good example is the AirFRANS dataset (Extrality/NACA_simulation) where OpenFOAM configuration is wrapped with python.

Production Deployment Considerations #

Several operational aspects become important when deploying active learning workflows in production environments.

Monitoring and Metrics

Comprehensive monitoring extends beyond validation loss to include:

Per-field error metrics (e.g., separate errors for velocity, pressure, temperature)
Sample efficiency curves relating performance to dataset size
Query strategy effectiveness indicators
Solver performance statistics (execution time, convergence behavior, failure rates)

These metrics enable informed decisions about workflow termination and strategy effectiveness.

Learning Rate Schedules

Active learning requires distinct learning rate schedules for initial training and subsequent fine-tuning phases. A representative configuration:

Initial training (step 0): CosineAnnealingLR from 1e-3 to 1e-6 over 100 epochs
Fine-tuning (steps 1+): ExponentialLR from 5e-4 to 5e-6 over 10 epochs

The lower initial learning rate and shorter schedule for fine-tuning helps prevent catastrophic forgetting of previously learned patterns.

Termination Criteria

Active learning workflows should terminate when additional data provides diminishing returns. Common termination indicators include:

Validation error plateaus over consecutive iterations
Marginal performance improvements no longer justify solver costs
Exhaustion of unlabeled data pool or computational budget