What is it about?

This paper studies an online evaluation method called Interleaving for search ranking algorithms, which has shown to be orders of magnitude more sensitive than traditional A/B tests. The simple interleaving method can produce biased and incorrect result. Other methods that propose to fix the bias are either less sensitive or not implementable in large scale systems such as Amazon. We introduce a novel interleaving method that is unbiased, sensitive, and simple to implement. We report a 60x sensitivity gain of our new method over A/B testing, based on 10 large-scale e-commerce experiments on billions of search queries. We analyze the theoretical and empirical properties of our method, and compare with alternative interleaving techniques in the context of large-scale experiments.

Featured Image

Why is it important?

A/B testing has become a bottleneck to search ranking innovations because online experiments typically take several weeks and large chunks of search traffic to reach meaningful (i.e. statistically significant) conclusions. Our novel interleaving method achieves a 60x sensitivity gain over A/B testing, so that we can effectively evaluate the same ranking innovations with much shorter time and less search traffic. That also means less user exposure to potentially suboptimal innovations. Additionally, our method is implementable for large-scale experiments, making it a game changer in speeding up search innovations in the e-commerce setting.

Read the Original

This page is a summary of: Debiased Balanced Interleaving at Amazon Search, October 2022, ACM (Association for Computing Machinery),
DOI: 10.1145/3511808.3557123.
You can read the full text:

Read

Contributors

The following have contributed to this page