NVIDIA Clara Train 3.1
3.1

# ai4med.libs.optimizers package

## Submodules

class NovoGrad(learning_rate=1.0, beta1=0.95, beta2=0.98, epsilon=1e-08, weight_decay=0.0, grad_averaging=False, use_locking=False, name='NovoGrad')

Bases: tensorflow.python.training.momentum.MomentumOptimizer

Optimizer that implements SGD with layer-wise normalized gradients, when normalization is done by sqrt(ema(sqr(grads))), similar to Adam

Second moment = ema of Layer-wise sqr of grads:

v_t <– beta2*v_{t-1} + (1-beta2)*(g_t)^2

First moment has two mode: 1. moment of grads normalized by u_t:

m_t <- beta1*m_{t-1} + lr_t * [ g_t/sqrt(v_t+epsilon)]

1. moment similar to Adam: ema of grads normalized by u_t: m_t <- beta1*m_{t-1} + lr_t * [(1-beta1)*(g_t/sqrt(v_t+epsilon))]

if weight decay add wd term after grads are rescaled by 1/sqrt(v_t):

m_t <- beta1*m_{t-1} + lr_t * [g_t/sqrt(v_t+epsilon) + wd*w_{t-1}]

Weight update:

w_t <- w_{t-1} - *m_t

Parameters
• learning_rate – A Tensor or a floating point value. The learning rate.

• beta1 – A Tensor or a float, used in ema for momentum.Default = 0.95.

• beta2 – A Tensor or a float, used in ema for grad norms.Default = 0.99.

• epsilon – a float. Default = 1e-8.

• weight_decay – A Tensor or a float, Default = 0.0.

• grad_averaging – switch between Momentum and SAG, Default = False,

• use_locking – If True use locks for update operations.

• name – Optional, name prefix for the ops created when applying gradients. Defaults to “NovoGrad”.

• use_nesterov – If True use Nesterov Momentum.

Constructor:

Parameters
• learning_rate – A Tensor or a floating point value. The learning rate.

• beta1 – A Tensor or a float, used in ema for momentum.Default = 0.95.

• beta2 – A Tensor or a float, used in ema for grad norms.Default = 0.99.

• epsilon – a float. Default = 1e-8.

• weight_decay – A Tensor or a float, Default = 0.0.

• grad_averaging – switch between Momentum and SAG, Default = False,

• use_locking – If True use locks for update operations.

• name – Optional, name prefix for the ops created when applying gradients. Defaults to “NovoGrad”.

• use_nesterov – If True use Nesterov Momentum.

apply_gradients(grads_and_vars, global_step=None, name=None)

This is the second part of minimize(). It returns an Operation that applies gradients.

Parameters

• global_step – Optional Variable to increment by one after the variables have been updated.

• name – Optional name for the returned operation. Default to the name passed to the Optimizer constructor.

Returns

An Operation that applies the specified gradients. If global_step was not None, that operation also increments global_step.

Raises
• TypeError – If grads_and_vars is malformed.

• ValueError – If none of the variables have gradients.

• RuntimeError – If you should use _distributed_apply() instead.