emerging_optimizers.psgd#

emerging_optimizers.psgd.procrustes_step.procrustes_step(Q, max_step_size=0.125, eps=1e-08, order=2)[source]#

One step of an online solver for the orthogonal Procrustes problem.

The orthogonal Procrustes problem is \(\min_U \| U Q - I \|_F\) s.t. \(U^H U = I\) by rotating Q as \(\exp(a R) Q\), where \(R = Q^H - Q\) is the generator and \(\|a R\| < 1\).

If using 2nd order expansion, max_step_size should be less than \(1/4\) as we only expand \(\exp(a R)\) to its 2nd order term. If using 3rd order expansion, max_step_size should be less than \(5/8\). This method is an expansion of a Lie algebra parametrized rotation that uses a simple approximate line search to find the optimal step size, from Xi-Lin Li.

Parameters:

Q (Tensor) – Tensor of shape (n, n), general square matrix to orthogonalize.
max_step_size (float) – Maximum step size for the line search. Default is 1/8. (0.125)
eps (float) – Small number for numerical stability.
order (Literal[2, 3]) – Order of the Taylor expansion. Must be 2 or 3. Default is 2.

Return type:

Tensor

emerging_optimizers.psgd.psgd_kron_contractions.apply_kronecker_factors(Q_list, X)[source]#

Apply all Kronecker factors once to tensor \(X\), each to its corresponding dimension.

This applies each \(Q\) factor once, for example in 2D case: \(Q_1 X Q_2^T\).

Parameters:

Q_list (List[Tensor]) – List of \(Q\) (the upper-triangular Kronecker factors), each of shape (d_i, d_i) or (d_i,).
X (Tensor) – Tensor of shape (d_0, d_1, …, d_N).

Returns:

Tensor of shape (d_0, d_1, …, d_N).

Return type:

Tensor

emerging_optimizers.psgd.psgd_kron_contractions.apply_preconditioner(Q_list, X)[source]#

Apply the full PSGD preconditioner to X.

This is the full Kronecker product of PSGD’s kronecker factors Q^T Q, applied to X.

\(P X = (Q_1^T Q_1) X (Q_2^T Q_2)\)

This applies each factor followed by its transpose for the full preconditioner effect.

Parameters:

Q_list (List[Tensor]) – List of \(Q\) (the Kronecker factors), each of shape (d_i, d_i) or (d_i,).
X (Tensor) – Tensor of shape (d_0, d_1, …, d_N).

Returns:

Tensor of shape (d_0, d_1, …, d_N).

Return type:

Tensor

emerging_optimizers.psgd.psgd_kron_contractions.partial_contraction(G1, G2, axis)[source]#

Compute the partial contraction of G1 and G2 along axis axis. This is the contraction of the two tensors, but with all axes except axis contracted.

Parameters:

G1 (Tensor) – Tensor of shape (d_0, d_1, …, d_{axis-1}, d_{axis}, d_{axis+1}, …, d_N)
G2 (Tensor) – Tensor of shape (d_0, d_1, …, d_{axis-1}, d_{axis}, d_{axis+1}, …, d_N)
axis (int) – int, the axis to contract along

Returns:

Tensor of shape (d_{axis}, d_{axis})

Return type:

Tensor

emerging_optimizers.psgd.psgd_utils.norm_lower_bound_skew(A, k=32, half_iters=2, eps=1e-08)[source]#

A cheap lower bound on the spectral norm (largest eigenvalue) of skew-symmetric matrix.

Note: For skew-symmetric matrices, all diagonal entries are zero and \(A^T = -A\). From Xi-Lin Li.

Parameters:

A (Tensor) – Tensor of shape \((n, n)\), skew-symmetric.
k (int) – Dimension of the subspace. Suggested values: 128 for bfloat16, 32 for float32, 4 for float64.
half_iters (int) – Half of the number of subspace iterations.
eps (float) – Small number for numerical stability.

Returns:

A scalar Tensor giving a lower bound on \(\|A\|_2\).

Return type:

Tensor

emerging_optimizers.psgd.psgd_utils.norm_lower_bound_spd(A, k=32, half_iters=2, eps=1e-08)[source]#

A cheap lower bound for the spectral norm of a symmetric positive definite matrix.

Parameters:

A (Tensor) – Tensor of shape \((n, n)\), symmetric positive definite.
k (int) – Dimension of the subspace.
half_iters (int) – Half of the number of subspace iterations.
eps (float) – Small number for numerical stability.

Returns:

A scalar giving a lower bound on \(\\|A\\|_2\).

Return type:

Tensor

emerging_optimizers.psgd.psgd_utils.uniformize_q_in_place(Q_list)[source]#

Balance the dynamic ranges of kronecker factors in place to prevent numerical underflow or overflow.

Each tensor in Q_list is rescaled so that its maximum absolute entry becomes the geometric mean of all factors original maxima. This preserves the overall product of norms (and thus the scale of the Kronecker product) while avoiding numerical underflow or overflow when factors have widely differing magnitudes.

Given tensors \(Q_1, Q_2, \ldots, Q_n\):

Compute max-absolute norms: \(\|Q_i\|_\infty = \max(|Q_i|)\) for \(i = 1, \ldots, n\)
Compute geometric mean: \(g = \left(\prod_{i=1}^{n} \|Q_i\|_\infty \right)^{1/n}\)
Rescale each tensor: \(Q_i \leftarrow Q_i \cdot \frac{g}{\|Q_i\|_\infty}\)

This ensures \(\|Q_i\|_\infty = g\) for all \(i\), while preserving the norm of the Kronecker product \(Q_1 \otimes Q_2 \otimes \cdots \otimes Q_n\).

Parameters:: Q_list (List[Tensor]) – List of Q (e.g. the Kronecker factors), each tensor will be modified in place.
Returns:: None
Return type:: None