What is it about?

AI learning process is based on algorithms. The Stochastic Gradient Descent (SGD) algorithm, which is one of the most popular choices, depends on two main hyperparameters: the amount of data examples shown to the model at each learning step, called "batch size", and the magnitude of the learning steps, called "learning rate". We have shown that the selection of learning rate and batch size identifies three different regimes in which the SGD algorithm operates. In the first regime, corresponding to small batch sizes and large learning rates, the learning process takes small, random steps. In this case, the process is noisy and allows the AI to explore solutions that it wouldn’t have found otherwise. In the second regime, corresponding to large learning rates and batch sizes, the process takes large initial steps which strongly affect the final solution. In the third regime, given by large batches and smaller learning rates, the learning process is more predictable and less prone to random exploration. According to specific application cases, each of these scenarios has different benefits and drawbacks in terms of training speeds and final performances of the AI model.

Featured Image

Why is it important?

How the hyperparameters of training algorithms affect the AI learning process is not well understood. Therefore, their choice usually relies on expensive grid searches. Our work makes a significant step in solving this problem by identifying the distinct regimes in which the SGD algorithm operates. This result is important because state-of-the-art AI models are usually trained with the SGD algorithm or its variations. Therefore, its comprehension is a fundamental step in understanding the solutions found by deep networks and in choosing the hyperparamters in a principled way.

Perspectives

I enjoyed working on this project because it was an opportunity for me to use my background as a physicist to tackle open questions in AI. In fact, I think that important results are often obtained from the contamination of different disciplines and mindsets.

Antonio Sclocchi
Ecole Polytechnique Federale de Lausanne

Read the Original

This page is a summary of: On the different regimes of stochastic gradient descent, Proceedings of the National Academy of Sciences, February 2024, Proceedings of the National Academy of Sciences,
DOI: 10.1073/pnas.2316301121.
You can read the full text:

Read

Resources

Contributors

The following have contributed to this page