Action Items:
- Generate a deepdive with notebooklm for the different papers.
- Reproduce the paper as an exploratory shiny app
- Extend this for MCMC algs
- Code a Kalman filter optimizer
- Code a Bayesian smoother optimizer
- Consider these in the NPB framework an optimiser that keeps track of first second and third order terms for top hessian dimensions, rather than knowing the full hessian we only keep track of a few dimensions use them to do local quadratic approximation for fastest learning but ergodic exploration when we get stuck in a local minima. Another idea is trust regions - oscillations teach us about regions where SGD get stuck but will get unstack once it can wiggle up far enough.
Ever since I saw Hinton’s description of Stochastic Gradient Descent (SGD) , RMSProp, Momentum and problems like units getting saturated and “falling off the manifold” in his Neural Networks for Machine Learning course I have been fascinated by optimization algorithms. I became obsessed with diving deeper into the math behind these algorithms and how to make them better.
Similar and related issues keep coming up in training RL agents, and in when working with advanced probabilistic models within the Bayesian paradigm, which have complex loss surfaces and that need different mechanisms to optimize their parameters.
- Getting stuck in local minima - stochasticity can helps the alg hop out of local minima but it isn’t principled - it might hop out of the global minima if the path is too steep.
- Slow convergence - when the loss surface cross section is circular learning is fast but when if it is elliptical with some dimensions having high curvature whenever sgd tries to step into those dimensions it will overshoot and likely oscillate back and forth missing the direction of the actual gradient.
- Sensitivity to hyperparameters like learning rate, batch size, momentum. Ideally we should be able to adapt these on the fly so that they don’t matter as much.
- Exploding or vanishing gradients - when gradients are too large they can cause numerical instability and when they are too small they can slow down learning. This lead to
- the proliferation of normalization techniques like batch norm, layer norm, weight norm
- moving from saturating activation functions like Sigmoid and Tanh preferring activation based on ReLU, like Leaky ReLU, GeLU and away from saturating functions like sigmoid and tanh.
- use of architectures like ResNets that provide skip connections to help gradients flow better.
- use of LSTM and GRU in RNNs to help with long term dependencies and guard against vanishing gradients.
- Catastrophic forgetting - this is a phenomenon where a model forgets previously learned information upon learning new information. This is especially problematic in continual learning scenarios where the model needs to learn from a stream of data over time. One solution is to train multiple epochs on the same data but this is slow and inefficient.
- Loss of plasticity - this refers to the model’s inability to adapt to new information or changes in the environment. This may happen due to massive overfitting in deep learning and might be mitigated by regularization techniques like dropout, weight decay, data augmentation and mixup. Other ideas like early stopping that allows one to recover to an earlier point in training when the model was more plastic. Finally here ensambling via bagging and boosting can lead to smaller less overfitting and better plasticity with many smaller models that are expert in different parts of the input space and some form of gating mechanism to pick the best model for each input.
Many mechanism try to address these issues like: Batch normalization, layer normalization, adaptive learning rates, learning rate schedules, learning rate per layer, momentum, second order methods, learning rate warmup, gradient clipping, weight decay. Regularization techniques like dropout, data augmentation and mixup also help with optimization. Second order methods like K-FAC and natural gradients also try to address some of these issues by collecting more information about the loss surface curvature and using that to estimate the best direction to move in parameter space and the step size.
However beside trying to learn faster the mechanism that slows down learning also serves SGD to step out of local minima and hop out to other regions of the loss surface that might have better local minima….
One might imagine one mechanism to quickly explore a local region (perhaps using a kalman filter and second order information) and then another mechanism like bayesian search to explore other regions of the loss landscape ergodiccally.
In fact taking a more bayesian perspective we might get even more benefits from this approach by having an ensamble that made from a mixture of local minima. This would leverage the fact that we find a number of local minima and better yet if we can pick minima that provide diverse predictions. In this case we might do even better by using a mixture of experts approach where we have a gating network that picks the best expert for each input.
Insights
Deeper view
The most important aspect of this work is the methodology used to analyze the behavior of SGD and its variants. And the visualizations that lets us see the dynamics of SGD in action.
When curvature S > 2 / \mu or the learning rate in the optimizer diverges until it reaches a region for which the step size becomes smaller than the curvature.
If we track the “sharpness” to identify the top Hessian eigenvalue dimension over time, we should be able to use this with a Bayesian filtering and smoothing to find a much better trajectory.
The edge of stability is reached due to the loss, bringing us closer to the bottom of the loss canyon. Without adjusting our step size, we will eventually enter a region where the landscape is too narrow and start bouncing upwards. The loss increases, and the landscape becomes wider. (We might be in a new canyon.) And so we keep dropping down, hitting some new narrow region. The issue, though, seems to be restricted to one dimension, as in SGD, we only look at a minibatch, so we only take a step in (with a few dimensions)
So one thinks that if we could actually take a step towards the middle of the canyon rather than hopping up and down on the edge of stability, we would get to the local minima faster.
In fact, second-order methods can approximate the gradient and find the correct direction.
AFAIK, Batch norm let’s us take different step sizes in different dimensions….
It seems that there is nothing about the top unstable Hessian dimensions that isn’t very special - as we continue gradient descent, we will hug the landscape more closely and encounter, there will be more and more dimensions for which S > 2 / \mu, leading the optimizer to diverge.
Rich Sutton has an old paper with an deep learning algorithm that he calls Incremental Delta-Bar-Delta or (IDBD)
Resources:
Without going into too much details here are a few resources I found useful to understand SGD and its variants better:
(J. M. Cohen et al. 2025) Understanding optimization in deep learning with central flows paper
(J. Cohen et al. 2021) Gradient Descent on Neural Networks Typically Occurs at the Edge of Stability
(Jastrzebski et al. 2020) The Break-Even Point on Optimization Trajectories of Deep Neural Networks paper
(Andreyev and Beneventano 2025) Edge of Stochastic Stability: Revisiting the Edge of Stability for SGD
(Damian, Nichani, and Lee 2023) Self-Stabilization: The Implicit Bias of Gradient Descent at the Edge of Stability
Citation
@online{bochman2025,
author = {Bochman, Oren},
title = {Stochastic Gradient {Descent} -\/- a {Deep} {Dive}},
date = {2025-10-09},
url = {https://orenbochman.github.io/posts/2025/2025-10-09-SGD/},
langid = {en}
}