NVIDIA Clara Train 3.1
3.1

ai4med.libs.optimizers package

class NovoGrad(learning_rate=1.0, beta1=0.95, beta2=0.98, epsilon=1e-08, weight_decay=0.0, grad_averaging=False, use_locking=False, name='NovoGrad')

Bases: tensorflow.python.training.momentum.MomentumOptimizer

Optimizer that implements SGD with layer-wise normalized gradients, when normalization is done by sqrt(ema(sqr(grads))), similar to Adam

Second moment = ema of Layer-wise sqr of grads:

v_t <– beta2*v_{t-1} + (1-beta2)*(g_t)^2

First moment has two mode: 1. moment of grads normalized by u_t:

m_t <- beta1*m_{t-1} + lr_t * [ g_t/sqrt(v_t+epsilon)]

  1. moment similar to Adam: ema of grads normalized by u_t: m_t <- beta1*m_{t-1} + lr_t * [(1-beta1)*(g_t/sqrt(v_t+epsilon))]

if weight decay add wd term after grads are rescaled by 1/sqrt(v_t):

m_t <- beta1*m_{t-1} + lr_t * [g_t/sqrt(v_t+epsilon) + wd*w_{t-1}]

Weight update:

w_t <- w_{t-1} - *m_t

Parameters
  • learning_rate – A Tensor or a floating point value. The learning rate.

  • beta1 – A Tensor or a float, used in ema for momentum.Default = 0.95.

  • beta2 – A Tensor or a float, used in ema for grad norms.Default = 0.99.

  • epsilon – a float. Default = 1e-8.

  • weight_decay – A Tensor or a float, Default = 0.0.

  • grad_averaging – switch between Momentum and SAG, Default = False,

  • use_locking – If True use locks for update operations.

  • name – Optional, name prefix for the ops created when applying gradients. Defaults to “NovoGrad”.

  • use_nesterov – If True use Nesterov Momentum.

Constructor:

Parameters
  • learning_rate – A Tensor or a floating point value. The learning rate.

  • beta1 – A Tensor or a float, used in ema for momentum.Default = 0.95.

  • beta2 – A Tensor or a float, used in ema for grad norms.Default = 0.99.

  • epsilon – a float. Default = 1e-8.

  • weight_decay – A Tensor or a float, Default = 0.0.

  • grad_averaging – switch between Momentum and SAG, Default = False,

  • use_locking – If True use locks for update operations.

  • name – Optional, name prefix for the ops created when applying gradients. Defaults to “NovoGrad”.

  • use_nesterov – If True use Nesterov Momentum.

apply_gradients(grads_and_vars, global_step=None, name=None)

Apply gradients to variables.

This is the second part of minimize(). It returns an Operation that applies gradients.

Parameters
  • grads_and_vars – List of (gradient, variable) pairs as returned by compute_gradients().

  • global_step – Optional Variable to increment by one after the variables have been updated.

  • name – Optional name for the returned operation. Default to the name passed to the Optimizer constructor.

Returns

An Operation that applies the specified gradients. If global_step was not None, that operation also increments global_step.

Raises
  • TypeError – If grads_and_vars is malformed.

  • ValueError – If none of the variables have gradients.

  • RuntimeError – If you should use _distributed_apply() instead.

© Copyright 2020, NVIDIA. Last updated on Feb 2, 2023.