# Advanced Schemes and Tools¶

## Adaptive Activation Functions¶

In training of neural networks, in addition to leaning the linear transformations, one can also learn the nonlinear transformations to potentially improve the convergence as well as the accuracy of the model by using the global adaptive activation functions proposed by 3. Global adaptive activations consist of a single trainable parameter that is multiplied by the input to the activations in order to modify the slope of activations. Therefore, a nonlinear transformation at layer \(\ell\) will take the following form

where \(\mathcal{N}^\ell\) is the nonlinear transformation at layer \(\ell\), \(H^{\ell-1}\) is the output of the hidden layer \(\ell-1\), \(\theta\) is the set of model weights and biases, \(a\) is the global adaptive activation parameter, \(\sigma\) is the activation function, and \(\mathcal{L}^{\ell}\) is the linear transformation at layer \(\ell\). Similar to the network weights and biases, the global adaptive activation parameter \(a\) is also a trainable parameter, and these trainable parameters are optimized by

## Self-scalable tanh (Stan) activation function¶

As a variant of adaptive activation functions, 15 proposed self-scalable tanh (Stan) activation function, which allows normal flow of gradients to compute the required derivatives and also enable systematic scaling of the input-output mapping. Specifically, for \(i\) th neuron in \(k\) th layer ( \(k = 1, 2, \dots , D − 1, i = 1, 2, \dots, N_k\) ), Stan is defined as

where \(\beta_k^i\) is the neuron-wise trainable parameter initialized by 1.

An example for Stan activation is provided by `examples/helmholtz/helmholtz_stan.py`

.
As shown in Fig. 14, one can see that Stan activation
yields faster convergence and better validation accuracy.

## Sobolev (Gradient Enhanced) Training¶

Sobolev or gradient-enhanced training of physics-informed neural networks, proposed in
Son et al. 5 and Yu et al. 6, leverages the derivative information of the PDE
residuals in the training of a neural network solver. Specifically, in standard training of the neural network solvers,
we enforce a proper norm of the PDE residual to be zero, while in Sobolev or gradient-enhanced training, one can
take the first or even higher-order derivatives of the PDE residuals w.r.t. the spatial inputs and set a proper norm
of those residual
derivatives to zero as well. It has been reported in the reference papers 5 6 that
Sobolev or gradient-enhanced training can potentially give better training accuracies when compared to the standard
training of a neural network solver. However, care must be taken
when choosing the relative weights of the additional loss terms with respect to the standard PDE residual and
boundary condition loss terms or otherwise, Sobolev or gradient-enhanced training may even adversely affect the
training convergence and accuracy of a neural network solver. Additionally, Sobolev or gradient-enhanced training
increases the training time as the differentiation order will be increased and thus extra backpropagation
will be required. An example for Sobolev or gradient-enhanced training for navier-stokes equations
can be found at `examples/annular_ring/annular_ring_gradient_enhanced/annular_ring_gradient_enhanced.py`

## Importance Sampling¶

Suppose our problem is to find the optimal parameters \(\mathbf{\theta^*}\) such that the Monte Carlo approximation of the integral loss is minimized

where \(f\) is a uniform probability density function. In importance sampling, the sampling points are drawn from an alternative sampling distribution \(q\) such that the estimation variance of the integral loss is reduced, that is

Modulus offers point cloud importance sampling for improved convergence
and accuracy, as originally proposed in
12. In this scheme, the training
points are updated adaptively based on a sampling measure \(q\) for
a more accurate unbiased approximation of the loss, compared to uniform
sampling. Details on the importance sampling implementation in Modulus
are presented in `examples/ldc/ldc_2d_importance_sampling.py`

script.
Fig. 15 shows a comparison between
the uniform and importance sampling validation error results for the
annular ring example, showing better accuracy when importance sampling
is used. Here in this example, the training points are sampled according
to a distribution proportional to the 2-norm of the velocity
derivatives. The sampling probability computed at iteration 100K is also
shown in Fig. 16.

## Quasi-Random Sampling¶

Training points in Modulus are generated according to a uniform
distribution by default. An alternative to uniform sampling is the
quasi-random sampling, which provides the means to generate training
points with a low level of discrepancy across the domain. Among the
popular low discrepancy sequences are the Halton sequences
13, the Sobol sequences, and the
Hammersley sets, out of which the Halton sequences are adopted in
Modulus. A snapshot of a batch of training points generated using uniform
sampling and Halton sequences for the annular ring example is shown in
Fig. 17. Halton
sequences for sample generation can be enabled by setting `quasirandom=True`

during the
constraint definition.
A case study on the use of Halton
sequences to solve a conjugate heat transfer example is also presented
in tutorial FPGA Heat Sink with Laminar Flow.

## Exact Boundary Condition Imposition¶

The standard neural network solvers impose boundary conditions in a soft form, by incorporating boundary conditions as constraints in form of additional loss terms in the loss function. In this form, the boundary conditions are not exactly satisfied. The work 4 introduced a new approach to exactly impose the boundary conditions in neural network solvers. For this, they introduced a geometry aware solution ansatz for the neural network solver that consists of an Approximate Distance Function (ADF) \(\phi (\mathbf{x})\) to the boundaries of the domain using the theory of R-functions. First, we will look into how this ADF is computed, and next, we will discuss the formation of the solution ansatz based on the type of the boundary conditions. 11

Let \(D \subset \mathbb{R}^d\) denote the computational domain with boundary \(\partial D\). The exact distance is the shortest distance between any point \(\mathbf{x} \in \mathbb{R}^d\) to the domain boundaries \(\partial D\), and therefore, is zero on \(\partial D\). The exact distance function is not second or higher-order differentiable, and thus, one can use the ADF function \(\phi (\mathbf{x})\) instead.

The exact boundary condition imposition in Modulus is currently limited to 2D geometries only. Let \(\partial D \in \mathbb{R}^2\) be a boundary composed of \(n\) line segments and curves \(D_i\), and \(\phi_i\) denote the ADF to each curve or line segment such that \(\phi_1 \cup \phi_2 \cup ... \cup \phi_n =\phi\). The properties of an ADF function are as follows: (1) For any point \(\mathbf{x}\) on \(\partial D\), \(\phi(x)=0\), and (2) \(\phi(x)\) is normalized to the \(m\)-th order, i.e., its derivative w.r.t the unit inward normal vector is one and second to \(m\)-th order derivatives are zero for all the points on \(\partial D\).

The elementary properties of R-functions, including R-disjunction (union), R-conjunction (intersection), and R-negation, can be used for constructing a composite ADF, \(\phi (\mathbf{x})\), to the boundary \(\partial D\) , when ADFs \(\phi_i(\mathbf{x})\), to the partitions of \(\partial D\) are known. Once the ADFs, \(\phi_i(\mathbf{x})\) to all the partitions of \(\partial D\) are calculated, we can calculate the ADF to \(\partial D\) using the R-equivalence operation. When \(\partial D\) is composed of \(n\) pieces, \(\partial D_i\), then the ADF \(\phi\) that is normalized up to order \(m\) is given by

Next, we will see how the individual ADFs \(\phi_i\) for line segments and arcs are calculated. For more details, please refer to the reference paper 4. The ADF for a infinite line passing through two pints \(\mathbf{x}_1 \equiv (x_1,y_1)\) and \(\mathbf{x}_2 \equiv (x_2,y_2)\) is calculated as

where \(L\) is the distance between the two points. Similarly ADF for a circle of radius \(R\) and center located at \(\mathbf{x}_c \equiv (x_c, y_c)\) is given by

In order to calculate the ADF for line segments and arcs, one has to use trimming functions 4. Let us consider a line segment of length \(L\) with end points \(\mathbf{x}_1 \equiv (x_1,y_1)\) and \(\mathbf{x}_2 \equiv (x_2,y_2)\), midpoint \(\mathbf{x}_c=(\frac{x_1+x_2}{2},\frac{y_1+y_2}{2})\) and length \(L = ||\mathbf{x}_2-\mathbf{x}_1||\). Then ADF for the line segment \(\phi(\mathbf{x})\) can be calculated as follows.

Note that here, \(f\) is the ADF for an infinite line and \(t\) is the trimming function which is the ADF for a circle. In other words, the ADF for a line segment is obtained by trimming an infinite line by a circle. Similarly, one can obtain the ADF for an arc by using the above equations and by setting \(f\) to the circle ADF and \(t\) to the ADF for an infinite line segment.

Now that we understand how to form the ADFs for line segments and arcs, let us discuss how we can form the solution ansatz using ADFs such that the boundary conditions are exactly satisfied. For Dirichlet boundary condition, if \(u=g\) is prescribed on \(\partial D\), then the solution ansatz is given by

where \(u_{sol}\) is the approximate solution, and \(u_{net}\) is the neural network output. To see how the solution ansatz if formed for Neumann, Robin, and mixed boundary conditions, please refer to the reference paper 4.

When different inhomogeneous essential boundary conditions are imposed on distinct subsets of \(\partial D\), we can use transfinite interpolation to calculate the \(g\) function, which represents the boundary condition function for the entire boundary \(\partial D\). The transfinite interpolation function can be written as

where weights \(w_i\) add up to one, and interpolates \(g_i\) on the set \(\partial D_i\). \(\mu_i \geq 1\) is a constant controlling the nature of interpolation. \(\epsilon\) is a small number to prevent division by zero. This boundary value function, \(g(\mathbf{x})\), can be used in the solution ansatz for Dirichlet boundary conditions to calculate the final solution with the exact imposition of BC.

The exact imposition of boundary conditions as proposed in the reference paper 4, however, has certain challenges especially when solving PDEs consisting of second or higher-order derivatives. Approximate distance functions constructed using the theory of R function are not normalized at the joining points of lines and arcs, and therefore, the second and higher-order derivatives are not defined at these points, and can take extremely large values close to those points which can affect the convergence behavior of the neural network. The solution represented in the reference paper 4 is not to sample the collocation points in close proximity of these points. We found, however, that this can adversely affect the convergence and final accuracy of the solution. As an alternative. we propose to use the first order formulation of the PDEs by change of variables. By treating the first order derivatives of the quantities of interest as new variables, we can rewrite the second order PDEs as a series of first order PDEs with additional compatibility equations that appear as additional terms in the loss function. For instance, let us consider the Helmholtz equation which takes the following form

where \(k\) is the wave number and \(f\) is the source term. One can define new variables \(u_x\), \(u_y\), \(u_z\), that represent, respectively, derivatives of the solution with respect to \(x\), \(y\), and \(z\) coordinates, and rewrite the Helmholtz equation as a set of first-order equations in the following form:

Using this form, the output of the neural network will now include \(u_x\), \(u_y\), \(u_z\)
in addition to \(u\), but this in effect reduces the order of differentiation by one. As a couple of examples,
first-order implementation
of the Helmholtz and Navier-Stokes equations are available at `examples/helmholtz/pdes/helmholtz_first_order.py`

and `examples/annular_ring/annular_ring_hardBC/pdes/navier_stokes_first_order.py`

, respectively.

An advantage of using the first-order formulation of PDEs is the potential speed-up in training iterations as extra backpropagations for computing the second-order derivatives are not performed anymore. Additionally, this formulation enables the use of Automatic Mixed Precision (AMP), which is currently not suitable to be used for problems with second and higher-order derivatives. Use of AMP can further accelerate the training.

The figure below shows a comparison of interior validation accuracy between a baseline model (soft BC
imposition and second-order PDE) and a model trained with hard BC imposition and first-order
PDEs. It is evident that the hard BC approach reduces the validation accuracy by about
one order of magnitude compared to the baseline model. Additionally, the boundary validation error for
the model trained with hard BC imposition is exactly zero unlike the baseline model. These examples
are available at `examples/helmholtz`

.

Using AMP, training of the model with exact BC imposition is 25% faster compared to the training of the baseline model.

Another example for solving the Navier-Stokes equations in the first-order form and with exact BC
imposition can be found in `examples/annular_ring/annular_ring_hardBC`

. The boundary conditions
in this example consist of the following: Prescribed parabolic inlet velocity on the left wall,
zero pressure on the right wall, and no-slip BC on the top/bottom walls and the inner circle.
The figure below shows the solution for the annular ring example with hard BC imposition.

## Causal training¶

Suppose that we have a time-dependent system of PDEs taking the following general form

where \(u\) describes the unknown latent solution that is governed
by the PDE system and \(\mathcal{N}\) is a possibly nonlinear differential operator.
As demonstrated in 14, continuous-time PINNs models can violate temporal causality, and hence are
susceptible to converge towards erroneous solutions for transient problems.
*Causal training* 14 aims to address this fundamental limitation and a key source of error
by reformulating the PDE residual loss to account explicitly for physical
causality during model training. To introduce it, we split the time domain \([0, T]\)
into \(N\) chunks \(\{ [t_i, t_{i+1}] \}_{i=0}^{N-1}\) and define the PDE residual loss over the \(i\)-th chunk

with \(\{t_j, x_j\} \subset [t_{i-1}, t_i] \times \Omega\).

Then the total causal loss is given by

where

Note that \(w_i\) is inversely exponentially proportional to the magnitude of the cumulative residual loss from the previous chunks. As a consequence, \(\mathcal{L}_i(\mathbf{\theta})\) will not be minimized unless all previous residuals decrease to some small value such that \(w_i\) is large enough. This simple algorithm enforces a PINN model to learn the PDE solution gradually, respecting the inherent causal structure of its dynamic evolution.

Implementation Details on causal training in Modulus are presented in script
`examples/wave_equation/wave_1d_causal.py`

. Fig. 20 presents a comparison of the validation error between
the baseline and causal training. It can be observed that causal training yields much better predictive
accuracy up to one order of magnitude.

It is worth noting that causal training scheme can be seamlessly combined with the moving time-window and different
network architectures in Modulus. For instance, the script `examples/taylor_green/taylor_green_causal.py`

illustrates
how to combine the causal loss function with the time-marching strategy for solving a complex transient Navier-Stokes problem.

## Learning Rate Annealing¶

The predominant approach in the training of PINNs is to represent the initial/boundary constraints as additive penalty terms to the loss function. This is usually done by multiplying a parameter \(\lambda\) to each of these terms to balance out the contribution of each term to the overall loss. However, tuning these parameters manually is not straightforward, and also requires treating these parameters as constants. The idea behind the learning rate annealing, as proposed in 1, is an automated and adaptive rule for dynamic tuning of these parameters during the training. Let us assume the loss function for a steady state problem takes the following form

where the superscript \((i)\) represents the training iteration index. Then, at each training iteration, the learning rate annealing scheme 1 computes the ratio between the gradient statistics for the PDE loss term and the boundary term, as follows

Finally, the annealing parameter \(\lambda^{(i)}\) is computed using an exponential moving average as follows

where \(\alpha\) is the exponential moving average decay.

## Homoscedastic Task Uncertainty for Loss Weighting¶

In 2, the authors have proposed to use a Gaussian likelihood with homoscedastic task uncertainty as the training loss in multi-task learning applications. In this scheme, the loss function takes the following form

where \(T\) is the total number of tasks (or residual and initial/boundary condition loss terms). Minimizing this loss is equivalent to maximizing the log Gaussian likelihood with homoscedastic uncertainty 2, and the uncertainty terms \(\sigma\) serve as adaptive wrights for different loss terms. Fig. 21 presents a comparison between the uncertainty loss weighting and no loss weighting for the annular ring example, showing that uncertainty loss weighting improves the training convergence and accuracy in this example. For details on this scheme, please refer to 2.

## SoftAdapt¶

Softadapt is a simple loss balancing algorithm that dynamically tunes the loss weights throughout the training. It measures the relative training progress for each loss term by measuring the ratio of the loss value at each iteration to its value at the previous iteration, and the loss weights are determined using these relative progress measurements passed through a softmax transformation, as follows

Here, \(w_j(i)\) is the weight for the loss term \(j\) at iteration \(i\), \(L_j(i)\) is the value for the loss term \(j\) at iteration \(i\), and \(n_{loss}\) is the number of loss terms. We have observed that this softmax transformation can easily cause overflow. Thus, we modify the softadapt equation using a softmax trick, as follows

where \(\mu(i) = \max \left(L_j(i)/L_j(i-1) \right)\), and \(\epsilon\) is a small number to prevent division by zero.

## Relative Loss Balancing with Random Lookback (ReLoBRaLo)¶

Relative Loss Balancing with Random Lookback (ReLoBRaLo) 8 is a modified version of the Softadapt, which adopts a moving average for loss weights and also a random lookback mechanism. The loss weights at each iteration are calculated as follows

Here, \(w_j(i)\) is the weight for the loss term \(j\) at iteration \(i\), \(\alpha\) is the moving average parameter, \(\beta\) is a Bernoulli random variable with an expected value close to 1. \(\hat{w}_j^{(i;i')}\) takes the following form

where \(n_{loss}\) is the number of loss terms, \(L_j(i)\) is the value for the loss term \(j\) at iteration \(i\), and \(\tau\) is called temperature 8. With very large values for temperature , loss weights tend to take similar values, while a value of zero for this parameter converts the softmax to an argmax function 8. Similar to the modified version of softadapt, we modify the equation for \(\hat{w}_j^{(i;i')}\) to prevent overflow and division by zero, as follows

where \(\mu(i) = \max \left(L_j(i)/L_j(i') \right)\), and \(\epsilon\) is a small number.

## GradNorm¶

One of the most popular loss balancing algorithms in computer vision and multi-task learning is GradNorm 9. In this algorithm, an additional loss term is minimized throughout the training that encourages the gradient norms for different loss terms to take similar relative magnitudes, such that the network is trained for different loss terms at similar rates. the loss weights are dynamically tuned throughout the training based on the relative training rates of different losses, as follows

Here, \(L_{gn}\) is the GradNorm loss. \(W\) is the subset of the neural network weights that is used in GradNorm loss, which is typically the weights for the last layer of the network in order to save on training costs. \(G_w^{(j)}(i) = || \nabla_W w_j(i) L_j(i)||_2\) is the \(L_2\) norm of the gradient of the weighted loss term \(j\) with respect to the weights \(W\) at iteration math:i. \(\bar{G}_W(i)=E [G_w^{(j)}(i)]\) is the average gradient norm across all training losses at iteration \(i\). Also, \(r_j(i)=\tilde{L}_j(i)/E[\tilde{L}_j(i)]\) is the relative inverse training rate corresponding to the loss term \(j\), where \(\tilde{L}_j(i)=L_j(i)/L_j(0)\) measures the inverse training rate. \(\alpha\) is a hyperparameter that defines the strength of training rate balancing 9.

When taking the gradients of the GradNorm loss \(L_{gn}(i, w_j(i))\), the reference gradient norm \(\bar{G}_W(i) \times \left[ r_j(i) \right]^\alpha\) is treated as a constant, and the gradnorm loss is minimized by differentiating only with respect to the loss weights \(w_j\). Finally, after each training iteration, the weights \(w_j\) are normalized such that \(\Sigma_j w_j(i)=n_{loss}\), where \(n_{loss}\) is the number of loss terms excluding the GradNorm loss. For more details on the GradNorm algorithm, please refer to the reference paper 9.

In the GradNorm algorithm, it is observed that in some cases the weights \(w_j\) can take negative values and that will adversely affect the training convergence of the neural network solver. To prevent this, in the Modulus implementation of the GradNorm, we use an exponential transformation of the trainable weight parameters to weigh the loss terms.

In the reference paper, GradNorm has shown to be effectively improving the accuracy and reducing overfitting for various network architectures and in both classification and regression tasks. Here, we have observed that GradNorm can also be effective in loss balancing of neural network solvers. In particular, we have tested the performance of GradNorm on the annular ring example by assigning a very small initial weight to the momentum loss terms and keeping the other loss weights intact compared to the base case. This is to evaluate whether it can recover appropriate loss weights throughout the training by starting from this poor initial loss weighting. Validation results are shown in the figure below. The blue line shows the base case, red shows the case where momentum equation are weighted by \(1e-4\) and no loss balancing algorithm is used, and orange shows the same case but with GradNorm used for loss balancing. It is evident that failure to balance the weight of the loss terms appropriately in this test case will result in convergence failure, and that GradNorm can effectively accomplish this.

## ResNorm¶

Residual Normalization (ResNorm) is a Modulus loss balancing scheme developed in collaboration with the National Energy Technology Laboratory (NETL). In this algorithm, which is a simplified variation of GradNorm, an additional loss term is minimized during training that encourages the individual losses to take similar relative magnitudes. The loss weights are dynamically tuned throughout the training based on the relative training rates of different losses, as follows:

Here, \(L_{rn}\) is the ResNorm loss, \(L_w^{(j)}(i)=w_j(i) L_j(i)\) is the weighted loss term \(j\) at iteration \(i\), and \(\bar{L}(i)=E [L_j(i)]\) is the average loss value across all training losses at iteration \(i\). Also, \(r_j(i)=\tilde{L}_j(i)/E[\tilde{L}_j(i)]\) is the relative inverse training rate corresponding to the loss term \(j\), where \(\tilde{L}_j(i)=L_j(i)/L_j(0)\) measures the inverse training rate. \(\alpha\) is a hyperparameter that defines the strength of training rate balancing.

Similar to GradNorm, when taking the gradients of the ResNorm loss \(L_{rn}(i, w_j(i))\) with respect to the loss weights \(w_j(i)\), the term \(\bar{L}_(i) \times \left[ r_j(i) \right]^\alpha\) is treated as a constant. Finally, after each training iteration, the weights \(w_j(i)\) are normalized such that \(\Sigma_j w_j(i)=n_{loss}\), where \(n_{loss}\) is the number of loss terms excluding the ResNorm loss. Notice that unlike GradNorm, ResNorm does not require computing gradients with respect to model parameters and thus, ResNorm can be computationally more efficient compared to GradNorm. Again, similar to the implementation of GradNorm, to prevent the loss weights from taking negative values, we use an exponential transformation of the trainable weight parameters to weigh the loss terms.

We test the performance of ResNorm on the annular ring example by assigning a very small initial weight to the momentum loss terms and keeping the other loss weights intact compared to the base case. This is to evaluate whether it can recover appropriate loss weights throughout the training by starting from this poor initial loss weighting. Validation results are shown in the figure below. It is evident that ResNorm can effectively find a good balance between the loss terms and provide reasonable convergence, while the baseline case without loss balancing fails to converge.

## Neural Tangent Kernel (NTK)¶

Neural Tangent Kernel (NTK) approach can be used to automatically assign weights to different loss terms. In the NTK perspective, the weight of each loss term should be proportional to the magnitude of NTK, so that every loss term will converge uniformly. Assume the total loss \(\mathcal{L}(\boldsymbol{\theta})\) is defined by

where

are the boundary loss and PDE residual loss, respectively. And the \(r\) is the PDE residual. Let \(\mathbf{J}_r\) and \(\mathbf{J}_b\) be the Jacobian of \(\mathcal{L}_r\) and \(\mathcal{L}_b\), respectively. The the NTK of them are defined as

According to 10, the weights are given by

where \(Tr(\cdot)\) is the trace operator.

We now assign the weights by NTK. The idea of NTK is, for each loss term, its convergence rate is indicated by its eigenvalues of NTK.
So, we reweight the loss terms by their eigenvalues such that each term has basically same convergence rate. For more details, please
refer 10. In Modulus, NTK can be computed automatically and weights can be assigned on the fly. The script `examples/helmholtz/helmholtz_ntk.py`

shows the NTK implementation
for a helmholtz problem. The Fig. 24 shows the results before NTK weighting. We observe that the maximum error is 0.04. Using NTK weights, this error is reduced to 0.006 as shown
in the Fig. 25.

## Selective Equations Term Suppression (Equation terms attention)¶

Selective Equations Term Suppression (SETS) is a feature developed in collaboration with National Energy Technology Laboratory (NETL).
For several PDEs, the terms in physical equations have different scales in time and magnitude (sometimes also known as stiff PDEs).
For such PDEs, the loss equation can appear to be minimized despite poor treatment of the smaller terms.
To tackle this, one can create multiple instances of the same PDE and freeze certain terms (freezing is achieved by stopping the gradient calls on the term using PyTorch’s `.detach()`

in the backend).
During the optimization process, this forces the optimizer to use the value from former iteration for the frozen terms.
Thus, the optimizer minimizes each term in the PDE and efficiently reduces the equation residual.
This prevents any one term in the PDE dominating the loss gradients (attention to every term).
Creating multiple instances with different frozen term in each allows the overall representation of the physics to remain same.

However, creating multiple instances of the same equation (with different frozen terms) also creates multiple loss terms, each of which can be weighted differently. This scheme can be coupled with other loss balancing algorithms like ResNorm, etc. to come up with the optimal task weights for these different instances.

An example of creating multiple instances of equations using Modulus APIs is provided in the script `examples/annular_ring_equation_instancing/annular_ring.py`

.
Although the incompressible navier stokes equations used in this example is not the best test for the feature (because the system of PDEs does not
exhibit any stiffness), creating multiple instances of the momentum equations with the advection and diffusion terms frozen separately, provides
improvement over the baseline. The effectiveness of this scheme is primarily observed more for a stiff system of PDEs with
large scale differences in the different terms.

References

- 1(1,2)
Wang, Sifan, Yujun Teng, and Paris Perdikaris. “Understanding and mitigating gradient flow pathologies in physics-informed neural networks.” SIAM Journal on Scientific Computing 43.5 (2021): A3055-A3081.

- 2(1,2,3)
Kendall, Alex, Yarin Gal, and Roberto Cipolla. “Multi-task learning using uncertainty to weigh losses for scene geometry and semantics.” Proceedings of the IEEE conference on computer vision and pattern recognition. 2018.

- 3
Jagtap, Ameya D., Kenji Kawaguchi, and George Em Karniadakis. “Adaptive activation functions accelerate convergence in deep and physics-informed neural networks.” Journal of Computational Physics 404 (2020): 109136.

- 4(1,2,3,4,5,6)
Sukumar, N., and Ankit Srivastava. “Exact imposition of boundary conditions with distance functions in physics-informed deep neural networks.” Computer Methods in Applied Mechanics and Engineering 389 (2022): 114333.

- 5(1,2)
Son, Hwijae, Jin Woo Jang, Woo Jin Han, and Hyung Ju Hwang. “Sobolev training for the neural network solutions of pdes.” arXiv preprint arXiv:2101.08932 (2021).

- 6(1,2)
Yu, Jeremy, Lu Lu, Xuhui Meng, and George Em Karniadakis. “Gradient-enhanced physics-informed neural networks for forward and inverse PDE problems.” arXiv preprint arXiv:2111.02801 (2021).

- 7
Heydari, A. Ali, Craig A. Thompson, and Asif Mehmood. “Softadapt: Techniques for adaptive loss weighting of neural networks with multi-part loss functions.” arXiv preprint arXiv:1912.12355 (2019).

- 8(1,2,3)
Bischof, Rafael, and Michael Kraus. “Multi-objective loss balancing for physics-informed deep learning.” arXiv preprint arXiv:2110.09813 (2021).

- 9(1,2,3)
Chen, Zhao, Vijay Badrinarayanan, Chen-Yu Lee, and Andrew Rabinovich. “Gradnorm: Gradient normalization for adaptive loss balancing in deep multitask networks.” In International Conference on Machine Learning, pp. 794-803. PMLR, 2018.

- 10(1,2)
Wang, S., Yu, X. and Perdikaris, P., 2022. When and why PINNs fail to train: A neural tangent kernel perspective. Journal of Computational Physics, 449, p.110768.

- 11
The contributors to the work on hard BC imposition using the theory of R-functions and the first-order formulation of the PDEs are: M. A. Nabian, R. Gladstone, H. Meidani, N. Sukumar, A. Srivastava.

- 12
Nabian, Mohammad Amin, Rini Jasmine Gladstone, and Hadi Meidani. “Efficient training of physics‐informed neural networks via importance sampling.” Computer‐Aided Civil and Infrastructure Engineering 36.8 (2021): 962-977.

- 13
Halton, John H. “On the efficiency of certain quasi-random sequences of points in evaluating multi-dimensional integrals.” Numerische Mathematik 2.1 (1960): 84-90.

- 14(1,2)
Wang, Sifan, Sankaran, Shyam, and Perdikaris, Paris. Respecting causality is all you need for training physics-informed neural networks. arXiv preprint arXiv:2203.07404, 2022.

- 15
Gnanasambandam, Raghav and Shen, Bo and Chung, Jihoon and Yue, Xubo and others. Self-scalable Tanh (Stan): Faster Convergence and Better Generalization in Physics-informed Neural Networks. arXiv preprint arXiv:2204.12589, 2022.