Regularization by gradient descent and getting rid of pesky learning rates

News Dec 06, 2017

Regularization of weights in a neural network or any nonlinear model is a hot topic in AI, all because of its capability of training better generalized models for prediction, classification, etc., or on anything AI is capable of doing these days. Whether it is predicting sales in financial evaluations, or tagging potentially habitable planets in space, to measuring incoming hazardous impacts in self-driving cars, weight regularization is incredibly important for ensuring that these functions perform with some logical framework. But despite these wonderful attributes, regularization comes at the cost of comprising accuracy in the training set if its learning constant (or rate), typically denoted as lambda (or λ), is set too high. AI scientists generally advice being extremely careful or cautious about choosing this learning rate and typically advice setting them in order of 1e-4. That’s how cautious one has to be about this learning rate.

However, a recent research study published in the Neural, Parallel and Scientific Computations Journal (Ravishankar, 2015) revealed that this contraption of a learning rate can be controlled by the computer completely without much human input, and not compromise accuracy on training at all. Turns out that if the Majestic King Lambda of Weight Regularization is applied independently to each weight in the model, he is nothing but a mere descendant of the poor Mr. Gradient Descent.

Mathematically, this lambda must be computed as an orthogonal vector of the gradient-descent based error-gradient weight update vector which would effectively help achieve the minimum weight norm for regularization. This vector in fact acts alongside the original error-gradient, and therefore accuracy on the training set is not compromised at all. Other tricks (or sorcery) required to ensure that this approach leads to the minimum weight norm indefinitely, include maintaining the inner product of the lambda vector with its operating weight vector be kept below zero the whole time and using adaptive step-sizes to maintain convergence.

The biggest problem in AI or Machine Learning research is to deal with learning rates or adaptive step-sizes no matter how many sorts of heuristics we place for them (LeCun et al., 2013). Even the most distinguished AI scientist, Dr. Yann LeCun, found this extremely annoying and devised a way to deal with it forever (LeCun et al., 2013). While his approach is more statistical in nature, my approach was more by the methods of Lyapunov Stability. This means that in LeCun’s method, the optimum learning rate is computed by first approximating the mean and variance of the weight update dynamics from the gradient information available during training, while in my method the same is computed, again from the gradient information, but without the mean-variance approximations. Nonetheless, both ensure convergence and have similar forms. Lyapunov Stability was, however, much more comfortable for me to explain convergence in the context of z-domain (or discrete-domain) poles. The adaptive step-sizes were formulated for both the error-gradient and the lambda vector and resulted in an automated way of dealing with them. The figures below show how the overall algorithm performs over linear and nonlinear spaces.

 

The future of this weight regularization algorithm holds a lot of potential, and this is all because of the fact that it obtains the true minimum weight norms (or parameter norms) for whatever problem it is trying to solve. This is something that areas of research, such as optimal control, are striving to achieve. However, in the context of AI or Machine Learning, for which this algorithm was primarily built for, it would enable the creation of models that would provide better informed decisions, i.e. with the problem of overfitting minimized or completely eradicated. This, in comparison to the industry standard of placing empirical values or using other techniques such as Dropout regularization (as is done in Computer Vision and Natural Language Processing applications today), would effectively save the engineers and scientist the hassle of dealing with the bias-variance problems, and obtain well generalized and production ready models in one-shot for real-world deployment.

 

References

  1. U. Ravishankar, “A Lyapunov based Adaptive and Stable Neural Network Weight Regularization Algorithm,” Neural, Parallel and Scientific Computations (NPSC), vol. 23, pp. 343 – 356, 2015.
  2. Y. LeCun, T. Schaul, and S. Zhang, “No More Pesky Learning Rates,” International Conference on Machine Learning, pp. 343 – 351, 2013.

 

The Author: Udhaya Ravishankar has recently joined the Brainpool as a freelance data scientist and gave a talk at the Pitch your Research event. Previously, he was a researcher at the Idaho National Laboratory where he gained most of his knowledge and experience in AI, Machine Learning and Optimal Control. Udhaya holds an MS degree in Electrical Engineering from the University of Southern California in the area of digital and analog micro-chip design, but also has a keen interest in neural network techniques for multimedia systems. It was through this interest he found an opportunity to work with Dr. Alice Parker at USC on her biomimetic real-time cortex project and eventually move into the field of AI. His current interests include developing algorithms to solve the contemporary problems with neural network model training.