Adaptive Pruning for Increased Robustness and Reduced Computational Overhead in Gaussian Process Accelerated Saddle Point Searches

Rohit Goswami; Hannes Jónsson

doi:10.1002/cphc.202500730

What is it about?

Gaussian process (GP) regression can accelerate saddle point searches by building a surrogate energy surface on the fly, reducing the number of expensive quantum mechanical force evaluations. But as the GP collects data during a search, the cost of updating the model grows cubically with dataset size. On large problems, the surrogate model itself becomes the bottleneck. We solve this with farthest-point sampling guided by the Earth Mover's Distance, a metric that measures structural similarity between molecular configurations in a permutation-invariant way. At each step, only a compact subset of geometrically diverse configurations feeds the hyperparameter optimization, while all data remain available for predictions. The cost of model updates stays nearly constant regardless of how many data points have been collected. Three stability controls complement the pruning: a logarithmic barrier on the signal variance, an oscillation detector that expands the training subset when the optimizer becomes unstable, and a data-driven trust radius based on the same transport distance. On a benchmark of 238 molecular reactions, the method (OT-GP) halves the mean wall time relative to its predecessor, requires a median of fewer than 30 force evaluations, and raises the success rate from 86% to 96%.

Why is it important?

Previous GP-accelerated saddle searches reduced force evaluation counts by an order of magnitude but often failed to deliver practical wall-time improvements because the hyperparameter optimization overhead grew with every new observation. The model also suffered from instabilities -- diverging variance, oscillating hyperparameters -- that caused silent failures on difficult systems. OT-GP addresses both problems through the same geometric framework. The optimal transport metric provides a physically motivated measure for data selection (pruning) and for determining when the model can be trusted (trust radius). The stability controls were each designed to address a specific failure mode diagnosed on real chemical systems. The combination of 2x wall-time reduction, 96% success rate, and fewer than 30 median force evaluations makes automated exploration of reaction networks practical at the DFT level. The method also provides training data for machine-learned interatomic potentials, since it generates exactly the high-energy transition state geometries that are underrepresented in typical training sets.

Perspectives

This work builds on the GP-dimer implementation from my earlier paper. That code reduced force evaluation counts but the GP model updates kept getting slower as the search progressed, sometimes negating the savings entirely. Instabilities caused roughly 14% of searches to fail without clear diagnostics. This was personally a very big deal for me, it was highlighted as a cover feature which my wife helped design, and it was in honor of my father's 60th birthday and the associated conference, all told one of my personal favorites. The idea of pruning the training set for hyperparameter optimization -- keeping all data for predictions but fitting on a representative subset -- came from recognizing that hyperparameters are mathematical fitting tools, not physical constants. They do not need the full dataset to converge. The Earth Mover's Distance turned out to be the right metric for selecting that subset. It respects atomic permutations (the same molecule with atoms relabeled looks identical) and provides a continuous measure of structural similarity. The same distance then serves as a trust radius, so the model does not extrapolate into regions where it has no data. The code is integrated in eOn (https://eondocs.org) and will form the basis for future work on automated reaction mechanism discovery.
Rohit Goswami
University of Iceland

This page is a summary of: Adaptive Pruning for Increased Robustness and Reduced Computational Overhead in Gaussian Process Accelerated Saddle Point Searches, ChemPhysChem, February 2026, Wiley,
DOI: 10.1002/cphc.202500730.
You can read the full text:

Read

Contributors

The following have contributed to this page

Rohit Goswami
University of Iceland

Geometry-aware data pruning halves wall time for GP-accelerated saddle searches

What is it about?

Why is it important?

Perspectives

Contributors

Discover more

Medical Research

Life Sciences

Physical Sciences

Technology and Engineering

Environmental Research

Arts and Humanities

Social Sciences

Business and Management

Geometry-aware data pruning halves wall time for GP-accelerated saddle searches

What is it about?

Featured Image

Why is it important?

Perspectives

Read the Original

Contributors

Share this page:

Discover more

Medical Research

Life Sciences

Physical Sciences

Technology and Engineering

Environmental Research

Arts and Humanities

Social Sciences

Business and Management