Optimizers

Common interfaces

class AbstractOptimizer

Base type for all optimizers.

class AbstractLearningRateScheduler

Base type for all learning rate scheduler.

class AbstractMomentumScheduler

Base type for all momentum scheduler.

class OptimizationState
batch_size

The size of the mini-batch used in stochastic training.

curr_epoch

The current epoch count. Epoch 0 means no training yet, during the first pass through the data, the epoch will be 1; during the second pass, the epoch count will be 1, and so on.

curr_batch

The current mini-batch count. The batch count is reset during every epoch. The batch count 0 means the beginning of each epoch, with no mini-batch seen yet. During the first mini-batch, the mini-batch count will be 1.

curr_iter

The current iteration count. One iteration corresponds to one mini-batch, but unlike the mini-batch count, the iteration count does not reset in each epoch. So it track the total number of mini-batches seen so far.

get_learning_rate(scheduler, state)
Parameters:
  • scheduler (AbstractLearningRateScheduler) – a learning rate scheduler.
  • state (OptimizationState) – the current state about epoch, mini-batch and iteration count.
Returns:

the current learning rate.

class LearningRate.Fixed

Fixed learning rate scheduler always return the same learning rate.

class LearningRate.Exp

\(\eta_t = \eta_0\gamma^t\). Here \(t\) is the epoch count, or the iteration count if decay_on_iteration is set to true.

class LearningRate.Inv

\(\eta_t = \eta_0 * (1 + \gamma * t)^(-power)\). Here \(t\) is the epoch count, or the iteration count if decay_on_iteration is set to true.

get_momentum(scheduler, state)
Parameters:
  • scheduler (AbstractMomentumScheduler) – the momentum scheduler.
  • state (OptimizationState) – the state about current epoch, mini-batch and iteration count.
Returns:

the current momentum.

class Momentum.Null

The null momentum scheduler always returns 0 for momentum. It is also used to explicitly indicate momentum should not be used.

class Momentum.Fixed

Fixed momentum scheduler always returns the same value.

get_updater(optimizer)
Parameters:optimizer (AbstractOptimizer) – the underlying optimizer.

A utility function to create an updater function, that uses its closure to store all the states needed for each weights.

Built-in optimizers

class AbstractOptimizerOptions

Base class for all optimizer options.

normalized_gradient(opts, state, grad)
Parameters:
  • opts (AbstractOptimizerOptions) – options for the optimizer, should contain the field grad_scale, grad_clip and weight_decay.
  • state (OptimizationState) – the current optimization state.
  • weight (NDArray) – the trainable weights.
  • grad (NDArray) – the original gradient of the weights.

Get the properly normalized gradient (re-scaled and clipped if necessary).

class SGD

Stochastic gradient descent optimizer.

SGD(; kwargs...)
Parameters:
  • lr (Real) – default 0.01, learning rate.
  • lr_scheduler (AbstractLearningRateScheduler) – default nothing, a dynamic learning rate scheduler. If set, will overwrite the lr parameter.
  • momentum (Real) – default 0.0, the momentum.
  • momentum_scheduler (AbstractMomentumScheduler) – default nothing, a dynamic momentum scheduler. If set, will overwrite the momentum parameter.
  • grad_clip (Real) – default 0, if positive, will clip the gradient into the bounded range [-grad_clip, grad_clip].
  • weight_decay (Real) – default 0.0001, weight decay is equivalent to adding a global l2 regularizer to the parameters.
class ADAM

The solver described in Diederik Kingma, Jimmy Ba: Adam: A Method for Stochastic Optimization. arXiv:1412.6980 [cs.LG].

ADAM(; kwargs...)
Parameters:
  • lr (Real) – default 0.001, learning rate.
  • lr_scheduler (AbstractLearningRateScheduler) – default nothing, a dynamic learning rate scheduler. If set, will overwrite the lr parameter.
  • beta1 (Real) – default 0.9.
  • beta2 (Real) – default 0.999.
  • epsilon (Real) – default 1e-8.
  • grad_clip (Real) – default 0, if positive, will clip the gradient into the range [-grad_clip, grad_clip].
  • weight_decay (Real) – default 0.00001, weight decay is equivalent to adding a global l2 regularizer for all the parameters.