What is it about?

Symbolic Regression (SR) is a branch of machine learning which attempts to learn analytic expressions (i.e. equations) which fit data accurately and in a highly interpretable manner. Traditionally, this involves optimising two objectives simultaneously, with the aim to find symbolic expressions which cannot be made more accurate without becoming more complex. While the loss is typically straightforward to quantify as the negative of the likelihood, complexity is an inherently ambiguous notion. Rather than using two separate objectives, one can combine accuracy and simplicity into a single goodness-of-fit statistic, using either Bayesian methods or applying the minimum description length (MDL) principle, which considers the length of a code needed to send the data with the help of a given function. In this paper we compare the Bayesian and MDL methods and propose two upgrades to the Bayesian approach. First, we discuss how the ranking of functions is essentially arbitrary if one is not careful about how to incorporate prior knowledge on the allowed values of the parameters of a function and propose a method to circumvent this problem. Second, we introduce a principled weighting of functions based on a n-gram language model, trained on equations previously seen in the context the user is considering. This is sensitive to the arrangement of operators relative to one another and the frequency of occurrence of each operator. Our approach is designed to quantify a scientist’s prior belief that a function such as sin(x) + sin(y) is much more likely to appear than sin(sin(x+y)), despite the two equations having the same number of variables and operators. We demonstrate that our methods show good performance relative to literature standards on benchmarks and apply our techniques to a real-world dataset from the field of cosmology.

Featured Image

Why is it important?

As scientists, we claim to understand physical phenomena if we can write down equations which accurately describe it. As such, one of the goals of SR is to uncover “physical laws” from data since this approach formalises the search over candidate equations which can accurately describe the data. To do this in a principled manner, one should not resort to the somewhat ad-hoc methods previously constructed to pick the best equations but instead seek a well-motivated selection criterion. This work develops more sophisticated model selection criteria based on Bayesian statistics and information theory, leading to new approaches for selecting equations that optimally balance accuracy with simplicity. If we truly are to automate the scientific process of converting data to equations, we want our algorithms to suggest candidate expressions which a human could have plausibly written down. We apply our language model to a real-world SR problem drawn from the field of cosmology to assess its ability to do this, and find that it achieves this aim with remarkable success. We reanalyse a sample of Type Ia supernovae to attempt to learn the equation describing the expansion rate of the Universe as a function of time. Without using a language model, equations which have nested powers are preferred, but these are physically unreasonable. After training a language model using a compilation of scientific equations, we find that such functions are disfavoured, and find that we rediscover the exact result from General Relativity as our fourth-best candidate function. Our methods have therefore demonstrated their ability to rediscover physical laws from real-world noisy data sets. Applying these to novel datasets thus has the tantalising possibility of achieving automated scientific discovery, as opposed to simply rediscovery.

Perspectives

I believe that SR can aid scientific discovery and assist in our interpretation of (or even be used instead of) more traditional ML methods, given the simplicity of the equations produced and their clear extrapolation behaviour when employed on data outside the range of the training set. Anyone who has played with SR algorithms before will have found that the resulting equations can become long and “nasty looking” very quickly, which prevents this aim being achieved. I hope this work can help produce more reasonable looking equations which enable this goal in a way that optimally balances accuracy, simplicity and interpretability.

Dr Deaglan Bartlett
Institut d'Astrophysique de Paris

Read the Original

This page is a summary of: Priors for symbolic regression, July 2023, ACM (Association for Computing Machinery),
DOI: 10.1145/3583133.3596327.
You can read the full text:

Read

Resources

Contributors

The following have contributed to this page