Optimizers¶

Common interfaces¶

class AbstractLearningRateScheduler¶: Base type for all learning rate scheduler.

class OptimizationState¶

curr_epoch¶: The current epoch count. Epoch 0 means no training yet, during the first pass through the data, the epoch will be 1; during the second pass, the epoch count will be 1, and so on.

curr_batch¶: The current mini-batch count. The batch count is reset during every epoch. The batch count 0 means the beginning of each epoch, with no mini-batch seen yet. During the first mini-batch, the mini-batch count will be 1.

curr_iter¶: The current iteration count. One iteration corresponds to one mini-batch, but unlike the mini-batch count, the iteration count does not reset in each epoch. So it track the total number of mini-batches seen so far.

get_learning_rate(scheduler, state)¶

Parameters:	scheduler (AbstractLearningRateScheduler) – a learning rate scheduler. state (OptimizationState) – the current state about epoch, mini-batch and iteration count.
Returns:	the current learning rate.

class LearningRate.Fixed¶: Fixed learning rate scheduler always return the same learning rate.

class LearningRate.Exp¶: \(\eta_t = \eta_0\gamma^t\). Here \(t\) is the epoch count, or the iteration count if decay_on_iteration is set to true.

class LearningRate.Inv¶: \(\eta_t = \eta_0 * (1 + \gamma * t)^(-power)\). Here \(t\) is the epoch count, or the iteration count if decay_on_iteration is set to true.

get_momentum(scheduler, state)¶

Parameters:	scheduler (AbstractMomentumScheduler) – the momentum scheduler. state (OptimizationState) – the state about current epoch, mini-batch and iteration count.
Returns:	the current momentum.

class Momentum.Null¶: The null momentum scheduler always returns 0 for momentum. It is also used to explicitly indicate momentum should not be used.

class Momentum.Fixed¶: Fixed momentum scheduler always returns the same value.

get_updater(optimizer)¶

Parameters:	optimizer (AbstractOptimizer) – the underlying optimizer.

A utility function to create an updater function, that uses its closure to store all the states needed for each weights.

normalized_gradient(opts, state, grad)¶

Parameters:	opts (AbstractOptimizerOptions) – options for the optimizer, should contain the field `grad_scale`, `grad_clip` and `weight_decay`. state (OptimizationState) – the current optimization state. weight (NDArray) – the trainable weights. grad (NDArray) – the original gradient of the weights.

Get the properly normalized gradient (re-scaled and clipped if necessary).

class SGD¶

Stochastic gradient descent optimizer.

SGD(; kwargs...)¶

Parameters:

lr (Real) – default 0.01, learning rate.
lr_scheduler (AbstractLearningRateScheduler) – default nothing, a dynamic learning rate scheduler. If set, will overwrite the lr parameter.
momentum (Real) – default 0.0, the momentum.
momentum_scheduler (AbstractMomentumScheduler) – default nothing, a dynamic momentum scheduler. If set, will overwrite the momentum parameter.
grad_clip (Real) – default 0, if positive, will clip the gradient into the bounded range [-grad_clip, grad_clip].
weight_decay (Real) – default 0.0001, weight decay is equivalent to adding a global l2 regularizer to the parameters.

class ADAM¶

The solver described in Diederik Kingma, Jimmy Ba: Adam: A Method for Stochastic Optimization. arXiv:1412.6980 [cs.LG].

ADAM(; kwargs...)¶

Parameters:

lr (Real) – default 0.001, learning rate.
lr_scheduler (AbstractLearningRateScheduler) – default nothing, a dynamic learning rate scheduler. If set, will overwrite the lr parameter.
beta1 (Real) – default 0.9.
beta2 (Real) – default 0.999.
epsilon (Real) – default 1e-8.
grad_clip (Real) – default 0, if positive, will clip the gradient into the range [-grad_clip, grad_clip].
weight_decay (Real) – default 0.00001, weight decay is equivalent to adding a global l2 regularizer for all the parameters.