Optimizers¶
Common interfaces¶
-
class
AbstractOptimizer¶ Base type for all optimizers.
-
class
AbstractLearningRateScheduler¶ Base type for all learning rate scheduler.
-
class
AbstractMomentumScheduler¶ Base type for all momentum scheduler.
-
class
OptimizationState¶ -
batch_size¶ The size of the mini-batch used in stochastic training.
-
curr_epoch¶ The current epoch count. Epoch 0 means no training yet, during the first pass through the data, the epoch will be 1; during the second pass, the epoch count will be 1, and so on.
-
curr_batch¶ The current mini-batch count. The batch count is reset during every epoch. The batch count 0 means the beginning of each epoch, with no mini-batch seen yet. During the first mini-batch, the mini-batch count will be 1.
-
curr_iter¶ The current iteration count. One iteration corresponds to one mini-batch, but unlike the mini-batch count, the iteration count does not reset in each epoch. So it track the total number of mini-batches seen so far.
-
-
get_learning_rate(scheduler, state)¶ Parameters: - scheduler (AbstractLearningRateScheduler) – a learning rate scheduler.
- state (OptimizationState) – the current state about epoch, mini-batch and iteration count.
Returns: the current learning rate.
-
class
LearningRate.Fixed¶ Fixed learning rate scheduler always return the same learning rate.
-
class
LearningRate.Exp¶ \(\eta_t = \eta_0\gamma^t\). Here \(t\) is the epoch count, or the iteration count if
decay_on_iterationis set to true.
-
class
LearningRate.Inv¶ \(\eta_t = \eta_0 * (1 + \gamma * t)^(-power)\). Here \(t\) is the epoch count, or the iteration count if
decay_on_iterationis set to true.
-
get_momentum(scheduler, state)¶ Parameters: - scheduler (AbstractMomentumScheduler) – the momentum scheduler.
- state (OptimizationState) – the state about current epoch, mini-batch and iteration count.
Returns: the current momentum.
-
class
Momentum.Null¶ The null momentum scheduler always returns 0 for momentum. It is also used to explicitly indicate momentum should not be used.
-
class
Momentum.Fixed¶ Fixed momentum scheduler always returns the same value.
-
get_updater(optimizer)¶ Parameters: optimizer (AbstractOptimizer) – the underlying optimizer. A utility function to create an updater function, that uses its closure to store all the states needed for each weights.
Built-in optimizers¶
-
class
AbstractOptimizerOptions¶ Base class for all optimizer options.
-
normalized_gradient(opts, state, grad)¶ Parameters: - opts (AbstractOptimizerOptions) – options for the optimizer, should contain the field
grad_scale,grad_clipandweight_decay. - state (OptimizationState) – the current optimization state.
- weight (NDArray) – the trainable weights.
- grad (NDArray) – the original gradient of the weights.
Get the properly normalized gradient (re-scaled and clipped if necessary).
- opts (AbstractOptimizerOptions) – options for the optimizer, should contain the field
-
class
SGD¶ Stochastic gradient descent optimizer.
-
SGD(; kwargs...)¶ Parameters: - lr (Real) – default 0.01, learning rate.
- lr_scheduler (AbstractLearningRateScheduler) – default nothing, a dynamic learning rate scheduler. If set, will overwrite the lr parameter.
- momentum (Real) – default 0.0, the momentum.
- momentum_scheduler (AbstractMomentumScheduler) – default nothing, a dynamic momentum scheduler. If set, will overwrite the momentum parameter.
- grad_clip (Real) – default 0, if positive, will clip the gradient into the bounded range [-grad_clip, grad_clip].
- weight_decay (Real) – default 0.0001, weight decay is equivalent to adding a global l2 regularizer to the parameters.
-
-
class
ADAM¶ The solver described in Diederik Kingma, Jimmy Ba: Adam: A Method for Stochastic Optimization. arXiv:1412.6980 [cs.LG].
-
ADAM(; kwargs...)¶ Parameters: - lr (Real) – default 0.001, learning rate.
- lr_scheduler (AbstractLearningRateScheduler) – default nothing, a dynamic learning rate scheduler. If set, will overwrite the lr parameter.
- beta1 (Real) – default 0.9.
- beta2 (Real) – default 0.999.
- epsilon (Real) – default 1e-8.
- grad_clip (Real) – default 0, if positive, will clip the gradient into the range [-grad_clip, grad_clip].
- weight_decay (Real) – default 0.00001, weight decay is equivalent to adding a global l2 regularizer for all the parameters.
-