What is it about?
Large Language Models (LLMs), such as those behind tools like ChatGPT, are trained on huge amounts of data from many sources. However, some of that data can be secretly “poisoned” by attackers who plant hidden patterns—called backdoors—that make the model behave incorrectly when those patterns appear. Our research explores a way to “clean” these models after they have already been trained, without needing to know what the hidden trigger is. We do this by gradually removing certain parts of the model’s attention mechanism—called attention heads—that contribute least to normal model accuracy. We test six different pruning strategies, including methods guided by gradients, randomness, reinforcement learning, and Bayesian uncertainty. Through experiments on language models fine-tuned for sentiment analysis, we show that pruning can reduce the effect of hidden backdoors while keeping the model’s normal performance high. In short, we provide a practical way to make open-source language models safer to use, even when their origins are uncertain.
Featured Image
Photo by Jefferson Santos on Unsplash
Why is it important?
As artificial intelligence models become more widespread, they are also becoming attractive targets for hidden attacks. These backdoor attacks can cause serious harm—spreading misinformation, bias, or even malicious outputs when specific triggers are used. Our work offers a new way to defend against these threats that doesn’t rely on knowing the trigger or having access to a clean version of the model. This is one of the first systematic studies to compare multiple attention-head pruning strategies for AI safety. The findings show that pruning, a relatively simple technique, can meaningfully reduce backdoor risks while maintaining accuracy. This approach is timely because it can be integrated into existing AI development pipelines to improve model trustworthiness at low cost.
Perspectives
Writing this paper was an exciting journey because it bridges two worlds—machine learning optimization and cybersecurity. We wanted to show that safety doesn’t always require expensive retraining or access to secret data; sometimes, careful engineering and smart pruning are enough to make models safer. I hope this work inspires other researchers and practitioners to think creatively about defending large language models. As AI continues to shape society, improving its robustness against hidden attacks isn’t just a technical challenge—it’s a responsibility to ensure safer, fairer, and more transparent AI systems.
SANTOSH CHAPAGAIN
Utah State University
Read the Original
This page is a summary of: Pruning Strategies for Backdoor Defense in LLMs, November 2025, ACM (Association for Computing Machinery),
DOI: 10.1145/3746252.3760946.
You can read the full text:
Contributors
The following have contributed to this page







