Optimizers¶
Common interfaces¶
-
class
AbstractOptimizer
¶ Base type for all optimizers.
-
class
AbstractLearningRateScheduler
¶ Base type for all learning rate scheduler.
-
class
AbstractMomentumScheduler
¶ Base type for all momentum scheduler.
-
class
OptimizationState
¶ -
batch_size
¶ The size of the mini-batch used in stochastic training.
-
curr_epoch
¶ The current epoch count. Epoch 0 means no training yet, during the first pass through the data, the epoch will be 1; during the second pass, the epoch count will be 1, and so on.
-
curr_batch
¶ The current mini-batch count. The batch count is reset during every epoch. The batch count 0 means the beginning of each epoch, with no mini-batch seen yet. During the first mini-batch, the mini-batch count will be 1.
-
curr_iter
¶ The current iteration count. One iteration corresponds to one mini-batch, but unlike the mini-batch count, the iteration count does not reset in each epoch. So it track the total number of mini-batches seen so far.
-
-
get_learning_rate
(scheduler, state)¶ Parameters: - scheduler (AbstractLearningRateScheduler) – a learning rate scheduler.
- state (OptimizationState) – the current state about epoch, mini-batch and iteration count.
Returns: the current learning rate.
-
class
LearningRate.
Fixed
¶ Fixed learning rate scheduler always return the same learning rate.
-
class
LearningRate.
Exp
¶ \(\eta_t = \eta_0\gamma^t\). Here \(t\) is the epoch count, or the iteration count if
decay_on_iteration
is set to true.
-
class
LearningRate.
Inv
¶ \(\eta_t = \eta_0 * (1 + \gamma * t)^(-power)\). Here \(t\) is the epoch count, or the iteration count if
decay_on_iteration
is set to true.
-
get_momentum
(scheduler, state)¶ Parameters: - scheduler (AbstractMomentumScheduler) – the momentum scheduler.
- state (OptimizationState) – the state about current epoch, mini-batch and iteration count.
Returns: the current momentum.
-
class
Momentum.
Null
¶ The null momentum scheduler always returns 0 for momentum. It is also used to explicitly indicate momentum should not be used.
-
class
Momentum.
Fixed
¶ Fixed momentum scheduler always returns the same value.
-
get_updater
(optimizer)¶ Parameters: optimizer (AbstractOptimizer) – the underlying optimizer. A utility function to create an updater function, that uses its closure to store all the states needed for each weights.
Built-in optimizers¶
-
class
AbstractOptimizerOptions
¶ Base class for all optimizer options.
-
normalized_gradient
(opts, state, grad)¶ Parameters: - opts (AbstractOptimizerOptions) – options for the optimizer, should contain the field
grad_scale
,grad_clip
andweight_decay
. - state (OptimizationState) – the current optimization state.
- weight (NDArray) – the trainable weights.
- grad (NDArray) – the original gradient of the weights.
Get the properly normalized gradient (re-scaled and clipped if necessary).
- opts (AbstractOptimizerOptions) – options for the optimizer, should contain the field
-
class
SGD
¶ Stochastic gradient descent optimizer.
-
SGD
(; kwargs...)¶ Parameters: - lr (Real) – default 0.01, learning rate.
- lr_scheduler (AbstractLearningRateScheduler) – default nothing, a dynamic learning rate scheduler. If set, will overwrite the lr parameter.
- momentum (Real) – default 0.0, the momentum.
- momentum_scheduler (AbstractMomentumScheduler) – default nothing, a dynamic momentum scheduler. If set, will overwrite the momentum parameter.
- grad_clip (Real) – default 0, if positive, will clip the gradient into the bounded range [-grad_clip, grad_clip].
- weight_decay (Real) – default 0.0001, weight decay is equivalent to adding a global l2 regularizer to the parameters.
-
-
class
ADAM
¶ The solver described in Diederik Kingma, Jimmy Ba: Adam: A Method for Stochastic Optimization. arXiv:1412.6980 [cs.LG].
-
ADAM
(; kwargs...)¶ Parameters: - lr (Real) – default 0.001, learning rate.
- lr_scheduler (AbstractLearningRateScheduler) – default nothing, a dynamic learning rate scheduler. If set, will overwrite the lr parameter.
- beta1 (Real) – default 0.9.
- beta2 (Real) – default 0.999.
- epsilon (Real) – default 1e-8.
- grad_clip (Real) – default 0, if positive, will clip the gradient into the range [-grad_clip, grad_clip].
- weight_decay (Real) – default 0.00001, weight decay is equivalent to adding a global l2 regularizer for all the parameters.
-