Advanced Schemes and Tools#

Adaptive Activation Functions#

In training of neural networks, in addition to leaning the linear transformations, one can also learn the nonlinear transformations to potentially improve the convergence as well as the accuracy of the model by using the global adaptive activation functions proposed by [3]. Global adaptive activations consist of a single trainable parameter that is multiplied by the input to the activations in order to modify the slope of activations. Therefore, a nonlinear transformation at layer \(\ell\) will take the following form

\[\mathcal{N}^\ell \left(H^{\ell-1}; \theta, a \right) = \sigma\left(a \mathcal{L}^{\ell} \left(H^{\ell-1}\right) \right),\]

where \(\mathcal{N}^\ell\) is the nonlinear transformation at layer \(\ell\), \(H^{\ell-1}\) is the output of the hidden layer \(\ell-1\), \(\theta\) is the set of model weights and biases, \(a\) is the global adaptive activation parameter, \(\sigma\) is the activation function, and \(\mathcal{L}^{\ell}\) is the linear transformation at layer \(\ell\). Similar to the network weights and biases, the global adaptive activation parameter \(a\) is also a trainable parameter, and these trainable parameters are optimized by

\[\theta^*, a^* = \underset{{\theta,a}}{\operatorname{argmin}} L(\theta,a).\]

Self-scalable tanh (Stan) activation function#

As a variant of adaptive activation functions, [14] proposed self-scalable tanh (Stan) activation function, which allows normal flow of gradients to compute the required derivatives and also enable systematic scaling of the input-output mapping. Specifically, for \(i\) th neuron in \(k\) th layer ( \(k = 1, 2, \dots , D − 1, i = 1, 2, \dots, N_k\) ), Stan is defined as

\[\sigma_k^i(x) = \text{tanh}(x) + \beta_k^i x \cdot \text{tanh}(x)\]

where \(\beta_k^i\) is the neuron-wise trainable parameter initialized by 1.

An example for Stan activation is provided by examples/helmholtz/helmholtz_stan.py. As shown in Fig. 52, one can see that Stan activation yields faster convergence and better validation accuracy.

A comparison of the validation error results between default SiLU (red) and Stan activations (blue) for the Helmholtz example. — Fig. 52 A comparison of the validation error results between default SiLU and Stan activations for the Helmholtz example.#

Sobolev (Gradient Enhanced) Training#

Sobolev or gradient-enhanced training of physics-informed neural networks, proposed in Son et al. [5] and Yu et al. [6], leverages the derivative information of the PDE residuals in the training of a neural network solver. Specifically, in standard training of the neural network solvers, we enforce a proper norm of the PDE residual to be zero, while in Sobolev or gradient-enhanced training, one can take the first or even higher-order derivatives of the PDE residuals w.r.t. the spatial inputs and set a proper norm of those residual derivatives to zero as well. It has been reported in the reference papers [5] [6] that Sobolev or gradient-enhanced training can potentially give better training accuracies when compared to the standard training of a neural network solver. However, care must be taken when choosing the relative weights of the additional loss terms with respect to the standard PDE residual and boundary condition loss terms or otherwise, Sobolev or gradient-enhanced training may even adversely affect the training convergence and accuracy of a neural network solver. Additionally, Sobolev or gradient-enhanced training increases the training time as the differentiation order will be increased and thus extra backpropagation will be required. An example for Sobolev or gradient-enhanced training for navier-stokes equations can be found at examples/annular_ring/annular_ring_gradient_enhanced/annular_ring_gradient_enhanced.py

Importance Sampling#

Suppose our problem is to find the optimal parameters \(\mathbf{\theta^*}\) such that the Monte Carlo approximation of the integral loss is minimized

\[\begin{split}\begin{aligned} \begin{split} \mathbf{\theta^*} &= \underset{{ \mathbf{\theta} }}{\operatorname{argmin}} \ \mathbb{E}_f \left[ \ell (\mathbf{\theta}) \right] \\ & \approx \underset{{ \mathbf{\theta} }}{\operatorname{argmin}} \ \frac{1}{N} \sum_{i=1}^{N} \ell (\mathbf{\theta}; \mathbf{x_i} ), \quad \mathbf{x_i} \sim {f}(\mathbf{x}), \label{IM_integral} \end{split}\end{aligned}\end{split}\]

where \(f\) is a uniform probability density function. In importance sampling, the sampling points are drawn from an alternative sampling distribution \(q\) such that the estimation variance of the integral loss is reduced, that is

\[\label{IM_unbiased} \mathbf{\theta^*} \approx \underset{{ \mathbf{\theta} }}{\operatorname{argmin}} \ \frac{1}{N} \sum_{i=1}^{N} \frac{f(\mathbf{x_i})}{q(\mathbf{x_i})} \ell (\mathbf{\theta}; \mathbf{x_i} ), \quad \mathbf{x_i} \sim q(\mathbf{x}).\]

PhysicsNeMo Sym offers point cloud importance sampling for improved convergence and accuracy, as originally proposed in [11]. In this scheme, the training points are updated adaptively based on a sampling measure \(q\) for a more accurate unbiased approximation of the loss, compared to uniform sampling. Details on the importance sampling implementation in PhysicsNeMo Sym are presented in examples/ldc/ldc_2d_importance_sampling.py script. Fig. 53 shows a comparison between the uniform and importance sampling validation error results for the annular ring example, showing better accuracy when importance sampling is used. Here in this example, the training points are sampled according to a distribution proportional to the 2-norm of the velocity derivatives. The sampling probability computed at iteration 100K is also shown in Fig. 54.

Fig. 53 A comparison between the uniform and importance sampling validation error results for the annular ring example.#

Fig. 54 A visualization of the training point sampling probability at iteration 100K for the annular ring example.#

Quasi-Random Sampling#

Training points in PhysicsNeMo Sym are generated according to a uniform distribution by default. An alternative to uniform sampling is the quasi-random sampling, which provides the means to generate training points with a low level of discrepancy across the domain. Among the popular low discrepancy sequences are the Halton sequences [12], the Sobol sequences, and the Hammersley sets, out of which the Halton sequences are adopted in PhysicsNeMo Sym. A snapshot of a batch of training points generated using uniform sampling and Halton sequences for the annular ring example is shown in Fig. 55. Halton sequences for sample generation can be enabled by setting quasirandom=True during the constraint definition. A case study on the use of Halton sequences to solve a conjugate heat transfer example is also presented in tutorial FPGA Heat Sink with Laminar Flow.

Fig. 55 A snapshot of a batch of training points generated using uniform sampling (top) and Halton sequences (bottom) for the annular ring example.#

Exact Boundary Condition Imposition#

The standard neural network solvers impose boundary conditions in a soft form, by incorporating boundary conditions as constraints in form of additional loss terms in the loss function. In this form, the boundary conditions are not exactly satisfied. The work [4] introduced a new approach to exactly impose the boundary conditions in neural network solvers. For this, they introduced a geometry aware solution ansatz for the neural network solver that consists of an Approximate Distance Function (ADF) \(\phi (\mathbf{x})\) to the boundaries of the domain using the theory of R-functions. First, we will look into how this ADF is computed, and next, we will discuss the formation of the solution ansatz based on the type of the boundary conditions. [10]

Let \(D \subset \mathbb{R}^d\) denote the computational domain with boundary \(\partial D\). The exact distance is the shortest distance between any point \(\mathbf{x} \in \mathbb{R}^d\) to the domain boundaries \(\partial D\), and therefore, is zero on \(\partial D\). The exact distance function is not second or higher-order differentiable, and thus, one can use the ADF function \(\phi (\mathbf{x})\) instead.

The exact boundary condition imposition in PhysicsNeMo Sym is currently limited to 2D geometries only. Let \(\partial D \in \mathbb{R}^2\) be a boundary composed of \(n\) line segments and curves \(D_i\), and \(\phi_i\) denote the ADF to each curve or line segment such that \(\phi_1 \cup \phi_2 \cup ... \cup \phi_n =\phi\). The properties of an ADF function are as follows: (1) For any point \(\mathbf{x}\) on \(\partial D\), \(\phi(x)=0\), and (2) \(\phi(x)\) is normalized to the \(m\)-th order, i.e., its derivative w.r.t the unit inward normal vector is one and second to \(m\)-th order derivatives are zero for all the points on \(\partial D\).

The elementary properties of R-functions, including R-disjunction (union), R-conjunction (intersection), and R-negation, can be used for constructing a composite ADF, \(\phi (\mathbf{x})\), to the boundary \(\partial D\) , when ADFs \(\phi_i(\mathbf{x})\), to the partitions of \(\partial D\) are known. Once the ADFs, \(\phi_i(\mathbf{x})\) to all the partitions of \(\partial D\) are calculated, we can calculate the ADF to \(\partial D\) using the R-equivalence operation. When \(\partial D\) is composed of \(n\) pieces, \(\partial D_i\), then the ADF \(\phi\) that is normalized up to order \(m\) is given by

\[\phi(\phi_1,...,\phi_n):=\phi_1~...~\phi_n=\frac{1}{\sqrt[m]{\frac{1}{(\phi_1)^m}+\frac{1}{(\phi_2)^m}+...+\frac{1}{(\phi_n)^m}}}.\]

Next, we will see how the individual ADFs \(\phi_i\) for line segments and arcs are calculated. For more details, please refer to the reference paper [4]. The ADF for a infinite line passing through two pints \(\mathbf{x}_1 \equiv (x_1,y_1)\) and \(\mathbf{x}_2 \equiv (x_2,y_2)\) is calculated as

\[\phi_l(\mathbf{x}; \mathbf{x}_1, \mathbf{x}_2) = \frac{(x-x_1)(y_2-y_1)-(y-y_1)(y_2-y_1)}{L},\]

where \(L\) is the distance between the two points. Similarly ADF for a circle of radius \(R\) and center located at \(\mathbf{x}_c \equiv (x_c, y_c)\) is given by

\[\phi_c(\mathbf{x}; R, \mathbf{x}_c) = \frac{R^2-(\mathbf{x}-\mathbf{x}_c).(\mathbf{x}-\mathbf{x}_c)}{2R}.\]

In order to calculate the ADF for line segments and arcs, one has to use trimming functions [4]. Let us consider a line segment of length \(L\) with end points \(\mathbf{x}_1 \equiv (x_1,y_1)\) and \(\mathbf{x}_2 \equiv (x_2,y_2)\), midpoint \(\mathbf{x}_c=(\frac{x_1+x_2}{2},\frac{y_1+y_2}{2})\) and length \(L = ||\mathbf{x}_2-\mathbf{x}_1||\). Then ADF for the line segment \(\phi(\mathbf{x})\) can be calculated as follows.

\[f = \phi_l(\mathbf{x}, \mathbf{x}_1, \mathbf{x}_2),\]

\[t = \phi_c(\mathbf{x}; R=\frac{L}{2}, \mathbf{x}_c=\frac{\mathbf{x}_1 + \mathbf{x}_2}{2}),\]

\[\Phi = \sqrt{t^2 + f^4},\]

\[\phi(\mathbf{x}) = \sqrt{f^2+(\frac{\Phi - t}{2})^2}.\]

Note that here, \(f\) is the ADF for an infinite line and \(t\) is the trimming function which is the ADF for a circle. In other words, the ADF for a line segment is obtained by trimming an infinite line by a circle. Similarly, one can obtain the ADF for an arc by using the above equations and by setting \(f\) to the circle ADF and \(t\) to the ADF for an infinite line segment.

Now that we understand how to form the ADFs for line segments and arcs, let us discuss how we can form the solution ansatz using ADFs such that the boundary conditions are exactly satisfied. For Dirichlet boundary condition, if \(u=g\) is prescribed on \(\partial D\), then the solution ansatz is given by

\[u_{sol} = g + \phi u_{net},\]

where \(u_{sol}\) is the approximate solution, and \(u_{net}\) is the neural network output. To see how the solution ansatz if formed for Neumann, Robin, and mixed boundary conditions, please refer to the reference paper [4].

When different inhomogeneous essential boundary conditions are imposed on distinct subsets of \(\partial D\), we can use transfinite interpolation to calculate the \(g\) function, which represents the boundary condition function for the entire boundary \(\partial D\). The transfinite interpolation function can be written as

\[g(\mathbf{x}) = \sum_{i=1}^{M} w_i(\mathbf{x})g_i(\mathbf{x}),\]

\[w_i(\mathbf{x}) = \frac{\phi_i^{-\mu_i}}{\sum_{j=1}^{M}\phi_j^{-\mu_j}} = \frac{\prod_{j=1;j \neq i}^{M} \phi_j^{-\mu_j}}{\sum_{k=1}^{M}\prod_{j=1;j \neq k}^{M} \phi_j^{-\mu_j} + \epsilon},\]

where weights \(w_i\) add up to one, and interpolates \(g_i\) on the set \(\partial D_i\). \(\mu_i \geq 1\) is a constant controlling the nature of interpolation. \(\epsilon\) is a small number to prevent division by zero. This boundary value function, \(g(\mathbf{x})\), can be used in the solution ansatz for Dirichlet boundary conditions to calculate the final solution with the exact imposition of BC.

The exact imposition of boundary conditions as proposed in the reference paper [4], however, has certain challenges especially when solving PDEs consisting of second or higher-order derivatives. Approximate distance functions constructed using the theory of R function are not normalized at the joining points of lines and arcs, and therefore, the second and higher-order derivatives are not defined at these points, and can take extremely large values close to those points which can affect the convergence behavior of the neural network. The solution represented in the reference paper [4] is not to sample the collocation points in close proximity of these points. We found, however, that this can adversely affect the convergence and final accuracy of the solution. As an alternative. we propose to use the first order formulation of the PDEs by change of variables. By treating the first order derivatives of the quantities of interest as new variables, we can rewrite the second order PDEs as a series of first order PDEs with additional compatibility equations that appear as additional terms in the loss function. For instance, let us consider the Helmholtz equation which takes the following form

\[k^2 u + \frac{\partial ^2 u}{\partial x ^2} + \frac{\partial ^2 u}{\partial y ^2} + \frac{\partial ^2 u}{\partial z ^2} = f,\]

where \(k\) is the wave number and \(f\) is the source term. One can define new variables \(u_x\), \(u_y\), \(u_z\), that represent, respectively, derivatives of the solution with respect to \(x\), \(y\), and \(z\) coordinates, and rewrite the Helmholtz equation as a set of first-order equations in the following form:

\[k^2 u + \frac{\partial u_x}{\partial x } + \frac{\partial u_y}{\partial y} + \frac{\partial u_z}{\partial z} = f,\]

\[u_x = \frac{\partial u}{\partial x },\]

\[u_y = \frac{\partial u}{\partial y },\]

\[u_z = \frac{\partial u}{\partial z }.\]

Using this form, the output of the neural network will now include \(u_x\), \(u_y\), \(u_z\) in addition to \(u\), but this in effect reduces the order of differentiation by one. As a couple of examples, first-order implementation of the Helmholtz and Navier-Stokes equations are available at examples/helmholtz/pdes/helmholtz_first_order.py and examples/annular_ring/annular_ring_hardBC/pdes/navier_stokes_first_order.py, respectively.

An advantage of using the first-order formulation of PDEs is the potential speed-up in training iterations as extra backpropagations for computing the second-order derivatives are not performed anymore. Additionally, this formulation enables the use of Automatic Mixed Precision (AMP), which is currently not suitable to be used for problems with second and higher-order derivatives. Use of AMP can further accelerate the training.

The figure below shows a comparison of interior validation accuracy between a baseline model (soft BC imposition and second-order PDE) and a model trained with hard BC imposition and first-order PDEs. It is evident that the hard BC approach reduces the validation accuracy by about one order of magnitude compared to the baseline model. Additionally, the boundary validation error for the model trained with hard BC imposition is exactly zero unlike the baseline model. These examples are available at examples/helmholtz.

Fig. 56 Interior validation accuracy for models trained with soft BC (orange) and hard BC (blue) imposition for the Helmholtz example.#

Using AMP, training of the model with exact BC imposition is 25% faster compared to the training of the baseline model.

Another example for solving the Navier-Stokes equations in the first-order form and with exact BC imposition can be found in examples/annular_ring/annular_ring_hardBC. The boundary conditions in this example consist of the following: Prescribed parabolic inlet velocity on the left wall, zero pressure on the right wall, and no-slip BC on the top/bottom walls and the inner circle. The figure below shows the solution for the annular ring example with hard BC imposition.

Solution for the annular ring example obtained using hard BC imposition. — Fig. 57 Solution for the the annular ring example obtained using hard BC imposition.#

Causal training#

Suppose that we have a time-dependent system of PDEs taking the following general form

(13)#\[u_t + \mathcal{N}[u] = 0, \quad t \in [0, T], x \in \Omega ,\]

where \(u\) describes the unknown latent solution that is governed by the PDE system and \(\mathcal{N}\) is a possibly nonlinear differential operator. As demonstrated in [13], continuous-time PINNs models can violate temporal causality, and hence are susceptible to converge towards erroneous solutions for transient problems. Causal training [13] aims to address this fundamental limitation and a key source of error by reformulating the PDE residual loss to account explicitly for physical causality during model training. To introduce it, we split the time domain \([0, T]\) into \(N\) chunks \(\{ [t_i, t_{i+1}] \}_{i=0}^{N-1}\) and define the PDE residual loss over the \(i\)-th chunk

\[\mathcal{L}_i(\mathbf{\theta}) = \sum_j | \frac{\partial u_{\mathbf{\theta}}}{\partial t}(t_j, x_j) + \mathcal{N}[u](t_j, x_j) |^2\]

with \(\{t_j, x_j\} \subset [t_{i-1}, t_i] \times \Omega\).

Then the total causal loss is given by

\[\mathcal{L}_r(\mathbf{\theta}) = \sum_{i=1}^N w_i \mathcal{L}_i(\mathbf{\theta}).\]

where

\[w_i = \exp(-\epsilon \sum_{k=1}^{i-1} \mathcal{L}_i(\mathbf{\theta}), \quad \text{for} i=2,3, \dots, N.\]

Note that \(w_i\) is inversely exponentially proportional to the magnitude of the cumulative residual loss from the previous chunks. As a consequence, \(\mathcal{L}_i(\mathbf{\theta})\) will not be minimized unless all previous residuals decrease to some small value such that \(w_i\) is large enough. This simple algorithm enforces a PINN model to learn the PDE solution gradually, respecting the inherent causal structure of its dynamic evolution.

Implementation Details on causal training in PhysicsNeMo Sym are presented in script examples/wave_equation/wave_1d_causal.py. Fig. 58 presents a comparison of the validation error between the baseline and causal training. It can be observed that causal training yields much better predictive accuracy up to one order of magnitude.

Fig. 58 Interior validation accuracy for models trained with (blue) and without (red) the causal loss function for the 1D wave equation example.#

It is worth noting that causal training scheme can be seamlessly combined with the moving time-window and different network architectures in PhysicsNeMo Sym. For instance, the script examples/taylor_green/taylor_green_causal.py illustrates how to combine the causal loss function with the time-marching strategy for solving a complex transient Navier-Stokes problem.

Learning Rate Annealing#

The predominant approach in the training of PINNs is to represent the initial/boundary constraints as additive penalty terms to the loss function. This is usually done by multiplying a parameter \(\lambda\) to each of these terms to balance out the contribution of each term to the overall loss. However, tuning these parameters manually is not straightforward, and also requires treating these parameters as constants. The idea behind the learning rate annealing, as proposed in [1], is an automated and adaptive rule for dynamic tuning of these parameters during the training. Let us assume the loss function for a steady state problem takes the following form

(14)#\[ L(\theta) = L_{residual}(\theta) + \lambda^{(i)} L_{BC}(\theta),\]

where the superscript \((i)\) represents the training iteration index. Then, at each training iteration, the learning rate annealing scheme [1] computes the ratio between the gradient statistics for the PDE loss term and the boundary term, as follows

\[\bar{\lambda}^{(i)} = \frac{max\left(\left|\nabla_{\theta}L_{residual}\left(\theta^{(i)}\right)\right|\right)}{mean \left(\left|\nabla_{\theta}L_{BC}\left(\theta^{(i)}\right)\right|\right)}.\]

Finally, the annealing parameter \(\lambda^{(i)}\) is computed using an exponential moving average as follows

\[\lambda^{(i)} = \alpha \bar{\lambda}^{(i)} + (1-\alpha) \lambda^{(i-1)},\]

where \(\alpha\) is the exponential moving average decay.

Homoscedastic Task Uncertainty for Loss Weighting#

In [2], the authors have proposed to use a Gaussian likelihood with homoscedastic task uncertainty as the training loss in multi-task learning applications. In this scheme, the loss function takes the following form

(15)#\[ L(\theta) = \sum_{j=1}^T \frac{1}{2\sigma_j^2} L_j(\theta) + \log \Pi_{j=1}^T \sigma_j,\]

where \(T\) is the total number of tasks (or residual and initial/boundary condition loss terms). Minimizing this loss is equivalent to maximizing the log Gaussian likelihood with homoscedastic uncertainty [2], and the uncertainty terms \(\sigma\) serve as adaptive wrights for different loss terms. Fig. 59 presents a comparison between the uncertainty loss weighting and no loss weighting for the annular ring example, showing that uncertainty loss weighting improves the training convergence and accuracy in this example. For details on this scheme, please refer to [2].

Fig. 59 A comparison between the uncertainty loss weighting vs. no loss weighting for the annular ring example.#

SoftAdapt#

Softadapt is a simple loss balancing algorithm that dynamically tunes the loss weights throughout the training. It measures the relative training progress for each loss term by measuring the ratio of the loss value at each iteration to its value at the previous iteration, and the loss weights are determined using these relative progress measurements passed through a softmax transformation, as follows

\[w_j(i) = \frac{\exp \left( \frac{L_j(i)}{L_j(i-1)} \right)}{\Sigma_{k=1}^{n_{loss}} \exp \left( \frac{L_k(i)}{L_k(i-1)} \right)}.\]

Here, \(w_j(i)\) is the weight for the loss term \(j\) at iteration \(i\), \(L_j(i)\) is the value for the loss term \(j\) at iteration \(i\), and \(n_{loss}\) is the number of loss terms. We have observed that this softmax transformation can easily cause overflow. Thus, we modify the softadapt equation using a softmax trick, as follows

\[w_j(i) = \frac{\exp \left( \frac{L_j(i)}{L_j(i-1) + \epsilon} - \mu(i) \right)}{\Sigma_{k=1}^{n_{loss}} \exp \left( \frac{L_k(i)}{L_k(i-1)+\epsilon} - \mu(i) \right)},\]

where \(\mu(i) = \max \left(L_j(i)/L_j(i-1) \right)\), and \(\epsilon\) is a small number to prevent division by zero.

Relative Loss Balancing with Random Lookback (ReLoBRaLo)#

Relative Loss Balancing with Random Lookback (ReLoBRaLo) [7] is a modified version of the Softadapt, which adopts a moving average for loss weights and also a random lookback mechanism. The loss weights at each iteration are calculated as follows

\[w_j(i) = \alpha \left( \beta w_j(i-1) + (1-\beta) \hat{w}_j^{(i;0)} \right) + (1-\alpha) \hat{w}_j^{(i;i-1)}.\]

Here, \(w_j(i)\) is the weight for the loss term \(j\) at iteration \(i\), \(\alpha\) is the moving average parameter, \(\beta\) is a Bernoulli random variable with an expected value close to 1. \(\hat{w}_j^{(i;i')}\) takes the following form

\[\hat{w}_j^{(i;i')} = \frac{n_{loss} \exp \left( \frac{L_j(i)}{\tau L_j(i')} \right)}{\Sigma_{k=1}^{n_{loss}} \exp \left( \frac{L_k(i)}{\tau L_k(i')}\right)},\]

where \(n_{loss}\) is the number of loss terms, \(L_j(i)\) is the value for the loss term \(j\) at iteration \(i\), and \(\tau\) is called temperature [7]. With very large values for temperature , loss weights tend to take similar values, while a value of zero for this parameter converts the softmax to an argmax function [7]. Similar to the modified version of softadapt, we modify the equation for \(\hat{w}_j^{(i;i')}\) to prevent overflow and division by zero, as follows

\[\hat{w}_j^{(i;i')} = \frac{n_{loss} \exp \left( \frac{L_j(i)}{\tau L_j(i') + \epsilon} - \mu(i) \right)}{\Sigma_{k=1}^{n_{loss}} \exp \left( \frac{L_k(i)}{\tau L_k(i') + \epsilon} - \mu(i) \right)},\]

where \(\mu(i) = \max \left(L_j(i)/L_j(i') \right)\), and \(\epsilon\) is a small number.

GradNorm#

One of the most popular loss balancing algorithms in computer vision and multi-task learning is GradNorm [8]. In this algorithm, an additional loss term is minimized throughout the training that encourages the gradient norms for different loss terms to take similar relative magnitudes, such that the network is trained for different loss terms at similar rates. the loss weights are dynamically tuned throughout the training based on the relative training rates of different losses, as follows

\[L_{gn}(i, w_j(i)) = \sum_j \left| G_w^{(j)}(i) - \bar{G}_W(i) \times \left[ r_j(i) \right]^\alpha \right|_1.\]

Here, \(L_{gn}\) is the GradNorm loss. \(W\) is the subset of the neural network weights that is used in GradNorm loss, which is typically the weights for the last layer of the network in order to save on training costs. \(G_w^{(j)}(i) = || \nabla_W w_j(i) L_j(i)||_2\) is the \(L_2\) norm of the gradient of the weighted loss term \(j\) with respect to the weights \(W\) at iteration math:i. \(\bar{G}_W(i)=E [G_w^{(j)}(i)]\) is the average gradient norm across all training losses at iteration \(i\). Also, \(r_j(i)=\tilde{L}_j(i)/E[\tilde{L}_j(i)]\) is the relative inverse training rate corresponding to the loss term \(j\), where \(\tilde{L}_j(i)=L_j(i)/L_j(0)\) measures the inverse training rate. \(\alpha\) is a hyperparameter that defines the strength of training rate balancing [8].

When taking the gradients of the GradNorm loss \(L_{gn}(i, w_j(i))\), the reference gradient norm \(\bar{G}_W(i) \times \left[ r_j(i) \right]^\alpha\) is treated as a constant, and the gradnorm loss is minimized by differentiating only with respect to the loss weights \(w_j\). Finally, after each training iteration, the weights \(w_j\) are normalized such that \(\Sigma_j w_j(i)=n_{loss}\), where \(n_{loss}\) is the number of loss terms excluding the GradNorm loss. For more details on the GradNorm algorithm, please refer to the reference paper [8].

In the GradNorm algorithm, it is observed that in some cases the weights \(w_j\) can take negative values and that will adversely affect the training convergence of the neural network solver. To prevent this, in the PhysicsNeMo Sym implementation of the GradNorm, we use an exponential transformation of the trainable weight parameters to weigh the loss terms.

In the reference paper, GradNorm has shown to be effectively improving the accuracy and reducing overfitting for various network architectures and in both classification and regression tasks. Here, we have observed that GradNorm can also be effective in loss balancing of neural network solvers. In particular, we have tested the performance of GradNorm on the annular ring example by assigning a very small initial weight to the momentum loss terms and keeping the other loss weights intact compared to the base case. This is to evaluate whether it can recover appropriate loss weights throughout the training by starting from this poor initial loss weighting. Validation results are shown in the figure below. The blue line shows the base case, red shows the case where momentum equation are weighted by \(1e-4\) and no loss balancing algorithm is used, and orange shows the same case but with GradNorm used for loss balancing. It is evident that failure to balance the weight of the loss terms appropriately in this test case will result in convergence failure, and that GradNorm can effectively accomplish this.

Fig. 60 GradNorm performance for loss balancing in the annular example. Blue: base case, red: momentum losses multiplied by 1e-4 and no loss balancing is used, orange: momentum losses multiplied by 1e-4 and GradNorm is used.#

ResNorm#

Residual Normalization (ResNorm) is a PhysicsNeMo Sym loss balancing scheme developed in collaboration with the National Energy Technology Laboratory (NETL). In this algorithm, which is a simplified variation of GradNorm, an additional loss term is minimized during training that encourages the individual losses to take similar relative magnitudes. The loss weights are dynamically tuned throughout the training based on the relative training rates of different losses, as follows:

\[L_{rn}(i, w_j(i)) = \sum_j \left| L_w^{(j)}(i) - \bar{L}(i) \times \left[ r_j(i) \right]^\alpha \right|_1.\]

Here, \(L_{rn}\) is the ResNorm loss, \(L_w^{(j)}(i)=w_j(i) L_j(i)\) is the weighted loss term \(j\) at iteration \(i\), and \(\bar{L}(i)=E [L_j(i)]\) is the average loss value across all training losses at iteration \(i\). Also, \(r_j(i)=\tilde{L}_j(i)/E[\tilde{L}_j(i)]\) is the relative inverse training rate corresponding to the loss term \(j\), where \(\tilde{L}_j(i)=L_j(i)/L_j(0)\) measures the inverse training rate. \(\alpha\) is a hyperparameter that defines the strength of training rate balancing.

Similar to GradNorm, when taking the gradients of the ResNorm loss \(L_{rn}(i, w_j(i))\) with respect to the loss weights \(w_j(i)\), the term \(\bar{L}_(i) \times \left[ r_j(i) \right]^\alpha\) is treated as a constant. Finally, after each training iteration, the weights \(w_j(i)\) are normalized such that \(\Sigma_j w_j(i)=n_{loss}\), where \(n_{loss}\) is the number of loss terms excluding the ResNorm loss. Notice that unlike GradNorm, ResNorm does not require computing gradients with respect to model parameters and thus, ResNorm can be computationally more efficient compared to GradNorm. Again, similar to the implementation of GradNorm, to prevent the loss weights from taking negative values, we use an exponential transformation of the trainable weight parameters to weigh the loss terms.

We test the performance of ResNorm on the annular ring example by assigning a very small initial weight to the momentum loss terms and keeping the other loss weights intact compared to the base case. This is to evaluate whether it can recover appropriate loss weights throughout the training by starting from this poor initial loss weighting. Validation results are shown in the figure below. It is evident that ResNorm can effectively find a good balance between the loss terms and provide reasonable convergence, while the baseline case without loss balancing fails to converge.

Fig. 61 ResNorm performance for loss balancing in the annular example. The following are plotted: Base line (orange), momentum losses multiplied by 1e-4 (light blue) and momentum losses multiplied by 1e-4 with ResNorm (dark blue).#

Neural Tangent Kernel (NTK)#

Neural Tangent Kernel (NTK) approach can be used to automatically assign weights to different loss terms. In the NTK perspective, the weight of each loss term should be proportional to the magnitude of NTK, so that every loss term will converge uniformly. Assume the total loss \(\mathcal{L}(\boldsymbol{\theta})\) is defined by

\[\mathcal{L}(\boldsymbol{\theta}) = \mathcal{L}_b(\boldsymbol{\theta}) + \mathcal{L}_r(\boldsymbol{\theta}),\]

where

\[\mathcal{L}_b(\boldsymbol{\theta}) = \sum_{i=1}^{N_b}|u(x_b^i,\boldsymbol{\theta})-g(x_b^i)|^2,\]

\[\mathcal{L}_r(\boldsymbol{\theta}) = \sum_{i=1}^{N_b}|r(x_r^i,\boldsymbol{\theta})|^2\]

are the boundary loss and PDE residual loss, respectively. And the \(r\) is the PDE residual. Let \(\mathbf{J}_r\) and \(\mathbf{J}_b\) be the Jacobian of \(\mathcal{L}_r\) and \(\mathcal{L}_b\), respectively. The the NTK of them are defined as

\[\mathbf{K}_{bb}=\mathbf{J}_b\mathbf{J}_b^T\qquad \mathbf{K}_{rr}=\mathbf{J}_r\mathbf{J}_r^T\]

According to [9], the weights are given by

\[\lambda_b = \frac{Tr(\mathbf{K}_{bb})+Tr(\mathbf{K}_{rr})}{Tr(\mathbf{K}_{bb})}\quad \lambda_r = \frac{Tr(\mathbf{K}_{bb})+Tr(\mathbf{K}_{rr})}{Tr(\mathbf{K}_{rr})},\]

where \(Tr(\cdot)\) is the trace operator.

We now assign the weights by NTK. The idea of NTK is, for each loss term, its convergence rate is indicated by its eigenvalues of NTK. So, we reweight the loss terms by their eigenvalues such that each term has basically same convergence rate. For more details, please refer [9]. In PhysicsNeMo Sym, NTK can be computed automatically and weights can be assigned on the fly. The script examples/helmholtz/helmholtz_ntk.py shows the NTK implementation for a helmholtz problem. The Fig. 62 shows the results before NTK weighting. We observe that the maximum error is 0.04. Using NTK weights, this error is reduced to 0.006 as shown in the Fig. 63.

Fig. 62 Helmholtz problem without NTK weights#

Selective Equations Term Suppression (Equation terms attention)#

Selective Equations Term Suppression (SETS) is a feature developed in collaboration with National Energy Technology Laboratory (NETL). For several PDEs, the terms in physical equations have different scales in time and magnitude (sometimes also known as stiff PDEs). For such PDEs, the loss equation can appear to be minimized despite poor treatment of the smaller terms. To tackle this, one can create multiple instances of the same PDE and freeze certain terms (freezing is achieved by stopping the gradient calls on the term using PyTorch’s .detach() in the backend). During the optimization process, this forces the optimizer to use the value from former iteration for the frozen terms. Thus, the optimizer minimizes each term in the PDE and efficiently reduces the equation residual. This prevents any one term in the PDE dominating the loss gradients (attention to every term). Creating multiple instances with different frozen term in each allows the overall representation of the physics to remain same.

However, creating multiple instances of the same equation (with different frozen terms) also creates multiple loss terms, each of which can be weighted differently. This scheme can be coupled with other loss balancing algorithms like ResNorm, etc. to come up with the optimal task weights for these different instances.

An example of creating multiple instances of equations using PhysicsNeMo Sym APIs is provided in the script examples/annular_ring_equation_instancing/annular_ring.py. Although the incompressible navier stokes equations used in this example is not the best test for the feature (because the system of PDEs does not exhibit any stiffness), creating multiple instances of the momentum equations with the advection and diffusion terms frozen separately, provides improvement over the baseline. The effectiveness of this scheme is primarily observed more for a stiff system of PDEs with large scale differences in the different terms.

Fig. 64 Equation instancing for annular ring example. Base line (orange), equation instancing (one instance with diffusion terms frozen and other with advection terms frozen) (gray).#

References