Optimizers
The optim module provides optimization algorithms for training neural networks.
Base Optimizer
Standard Optimizers
SGD
SGD optimizer implementations.
- class fit.optim.sgd.SGD(parameters: List[Tensor], lr: float = 0.01, momentum: float = 0.0, dampening: float = 0.0, weight_decay: float = 0.0, nesterov: bool = False)[source]
Bases:
objectStochastic Gradient Descent optimizer with optional momentum.
- __init__(parameters: List[Tensor], lr: float = 0.01, momentum: float = 0.0, dampening: float = 0.0, weight_decay: float = 0.0, nesterov: bool = False)[source]
Initialize SGD optimizer.
- Parameters:
parameters – List of parameters to optimize
lr – Learning rate
momentum – Momentum factor (0 for no momentum)
dampening – Dampening for momentum
weight_decay – L2 penalty coefficient
nesterov – Whether to use Nesterov momentum
- class fit.optim.sgd.SGDMomentum(parameters: List[Tensor], lr: float = 0.01, momentum: float = 0.9, dampening: float = 0.0, weight_decay: float = 0.0, nesterov: bool = False)[source]
Bases:
objectSGD with momentum optimizer (legacy - kept for compatibility).
Note: The main SGD class now supports momentum directly.
- __init__(parameters: List[Tensor], lr: float = 0.01, momentum: float = 0.9, dampening: float = 0.0, weight_decay: float = 0.0, nesterov: bool = False)[source]
Initialize SGD with momentum optimizer.
- Parameters:
parameters – List of parameters to optimize
lr – Learning rate
momentum – Momentum factor
dampening – Dampening for momentum
weight_decay – L2 penalty coefficient
nesterov – Whether to use Nesterov momentum
Adam
Adam optimizer implementation.
- class fit.optim.adam.Adam(parameters: List[Tensor], lr: float = 0.001, betas: tuple = (0.9, 0.999), eps: float = 1e-08, weight_decay: float = 0.0)[source]
Bases:
objectAdam optimizer with bias correction.
Combines the advantages of AdaGrad and RMSProp. Maintains moving averages of both gradients and squared gradients.
- __init__(parameters: List[Tensor], lr: float = 0.001, betas: tuple = (0.9, 0.999), eps: float = 1e-08, weight_decay: float = 0.0)[source]
Initialize Adam optimizer.
- Parameters:
parameters – List of parameters to optimize
lr – Learning rate
betas – Coefficients for computing running averages (beta1, beta2)
eps – Small constant for numerical stability
weight_decay – L2 penalty coefficient
- class fit.optim.adam.AdamW(parameters: List[Tensor], lr: float = 0.001, betas: tuple = (0.9, 0.999), eps: float = 1e-08, weight_decay: float = 0.01)[source]
Bases:
objectAdamW optimizer with decoupled weight decay.
Implements the weight decay fix from “Decoupled Weight Decay Regularization”.
- __init__(parameters: List[Tensor], lr: float = 0.001, betas: tuple = (0.9, 0.999), eps: float = 1e-08, weight_decay: float = 0.01)[source]
Initialize AdamW optimizer.
- Parameters:
parameters – List of parameters to optimize
lr – Learning rate
betas – Coefficients for computing running averages (beta1, beta2)
eps – Small constant for numerical stability
weight_decay – Weight decay coefficient (decoupled)
- class fit.optim.adam.Adamax(parameters: List[Tensor], lr: float = 0.002, betas: tuple = (0.9, 0.999), eps: float = 1e-08, weight_decay: float = 0.0)[source]
Bases:
objectAdamax optimizer (variant of Adam based on infinity norm).
- __init__(parameters: List[Tensor], lr: float = 0.002, betas: tuple = (0.9, 0.999), eps: float = 1e-08, weight_decay: float = 0.0)[source]
Initialize Adamax optimizer.
- Parameters:
parameters – List of parameters to optimize
lr – Learning rate
betas – Coefficients for computing running averages (beta1, beta2)
eps – Small constant for numerical stability
weight_decay – L2 penalty coefficient
- class fit.optim.adam.NAdam(parameters: List[Tensor], lr: float = 0.002, betas: tuple = (0.9, 0.999), eps: float = 1e-08, weight_decay: float = 0.0)[source]
Bases:
objectNAdam optimizer (Nesterov-accelerated Adam).
- __init__(parameters: List[Tensor], lr: float = 0.002, betas: tuple = (0.9, 0.999), eps: float = 1e-08, weight_decay: float = 0.0)[source]
Initialize NAdam optimizer.
- Parameters:
parameters – List of parameters to optimize
lr – Learning rate
betas – Coefficients for computing running averages (beta1, beta2)
eps – Small constant for numerical stability
weight_decay – L2 penalty coefficient
RMSprop
Advanced Optimizers
SAM (Sharpness-Aware Minimization)
Enhanced Sharpness-Aware Minimization (SAM) optimizer implementation.
SAM is a state-of-the-art optimizer that seeks parameters that lie in neighborhoods having uniformly low loss values, leading to better generalization than standard optimizers.
This implementation includes: - Adaptive sharpness - Efficient computation - Memory optimization - Support for all base optimizers
Paper: https://arxiv.org/abs/2010.01412
- class fit.optim.experimental.sam.SAM(parameters: List[Tensor], base_optimizer: Any, rho: float = 0.05, epsilon: float = 1e-12, adaptive: bool = False, auto_clip: bool = True)[source]
Bases:
objectEnhanced implementation of Sharpness-Aware Minimization for improved generalization.
SAM finds parameters that lie in neighborhoods having uniformly low loss values, making models more robust to perturbations and improving generalization performance. This is particularly effective for: - Training models that need to generalize well from limited data - Improving robustness to adversarial examples - Solving problems like XOR where sharp minima cause overfitting
- Parameters:
parameters – List of parameters to optimize
base_optimizer – Underlying optimizer (SGD, Adam, etc.) for parameter updates
rho – Size of the neighborhood to sample for sharpness (default: 0.05)
epsilon – Small constant for numerical stability (default: 1e-12)
adaptive – Whether to use adaptive SAM which adjusts influence per parameter (default: False)
auto_clip – Whether to automatically clip gradients during perturbation (default: True)
- __init__(parameters: List[Tensor], base_optimizer: Any, rho: float = 0.05, epsilon: float = 1e-12, adaptive: bool = False, auto_clip: bool = True)[source]
- first_step(zero_grad: bool = False)[source]
First step of SAM: Compute gradient at current position, perturb weights in the direction of steepest ascent, and save original weights.
- Parameters:
zero_grad – Whether to zero gradients after computing perturbation
- second_step(zero_grad: bool = False)[source]
Second step of SAM: Restore original weights and apply the actual update.
- Parameters:
zero_grad – Whether to zero gradients after the update
- step(closure: Callable | None = None)[source]
Single step that combines both SAM steps (for compatibility).
Note: This requires the closure to compute the loss function. For most use cases, use first_step() and second_step() separately.
- property lr
Get learning rate from base optimizer.
- class fit.optim.experimental.sam.AdaptiveSAM(parameters: List[Tensor], base_optimizer: Any, **kwargs)[source]
Bases:
SAMAdaptive SAM that automatically adjusts the perturbation size.
This variant adjusts the perturbation based on the parameter magnitudes, often leading to better performance on diverse problems.
Lion
Lion optimizer implementation.
The Lion (Evolved Sign Momentum) optimizer uses sign-based updates which require less memory than traditional optimizers like Adam, while often achieving better performance.
Paper: https://arxiv.org/abs/2302.06675
- class fit.optim.experimental.lion.Lion(parameters, lr=0.0001, betas=(0.9, 0.99), weight_decay=0.0)[source]
Bases:
objectLion optimizer (Evolved Sign Momentum).
Uses sign-based updates which require less memory than Adam while often achieving better performance. Lion typically requires 2-3x larger learning rates than Adam.
- Parameters:
parameters – List of parameters to optimize
lr – Learning rate (default: 1e-4)
betas – Coefficients for computing running averages (default: (0.9, 0.99))
weight_decay – Weight decay coefficient (default: 0.0)