What is it about?
Large Language Models (LLMs) like ChatGPT are powerful tools, but they can be tricked by malicious inputs—a type of attack called "prompt injection" or "jailbreaking"—into producing harmful content. Existing defenses tend to overreact: they often refuse harmless questions just because the wording sounds suspicious (for example, refusing to explain "how to kill a process" in programming). On the other hand, defenses that try to be smarter usually require expensive retraining of the AI model itself. In this study, we propose a new framework that uses multiple AI agents working together as two teams. The "Analysis Team" examines whether an incoming prompt is genuinely harmful or just superficially suspicious, while the "Generation Team" creates tricky borderline examples to help the Analysis Team learn from its mistakes. Instead of retraining the AI, the system learns by accumulating logs of past judgments and feedback—a technique called In-Context Learning (ICL). Experiments on three LLMs and five public datasets showed that our method improved the F1-score by an average of 16.6 points compared to the unmodified models, while reducing unnecessary refusals of benign questions.
Featured Image
Photo by Aerps.com on Unsplash
Why is it important?
As LLMs are deployed in more areas of daily life—customer service, education, healthcare, coding assistance—two opposing risks grow at the same time. If safety measures are too weak, attackers can extract harmful information. If they are too strict, the AI becomes useless because it refuses ordinary, legitimate questions. Striking the right balance is one of the central challenges in making AI trustworthy. What makes our approach significant is that it improves safety without retraining the model. Traditional safety methods like fine-tuning or Reinforcement Learning from Human Feedback (RLHF) require enormous computational resources, large amounts of labeled data, and specialized expertise—putting them out of reach for many developers, smaller organizations, and researchers working with open-source models. Our framework only needs to accumulate logs of past interactions, which means it can be deployed quickly and updated continuously as new attack methods emerge. This is especially valuable for open-source LLMs that often lack the rigorous safety alignment of large commercial models, helping to make a wider range of AI systems safe enough for real-world use.
Perspectives
This research opens several directions worth exploring further. First, our method works particularly well for LLMs with weaker built-in safety alignment, but its benefit shrinks for models that are already heavily aligned. Understanding how to make ICL-based defenses complement, rather than conflict with, strong internal safety mechanisms is an important next step. Second, we found that the framework still struggles with certain kinds of benign prompts—especially those involving fictional settings (such as questions about video games) or metaphorical expressions (such as "set the party on fire" when DJing). Real human language is full of such ambiguity, and teaching AI agents to recognize context without over-speculating about hidden intentions remains an open challenge. We are considering introducing a third classification label such as "cannot be determined from the text alone" to handle genuinely ambiguous cases more honestly. Third, while the multi-agent design improves accuracy, it also increases inference time and token consumption compared to running a single model. Future work will focus on optimizing log retrieval, reducing the number of agents involved when possible, and exploring whether the framework scales effectively to larger models. Finally, our experiments used only English datasets. Because the framework relies on semantic reasoning rather than language-specific features, we believe it should generalize to other languages—including Japanese—but this needs to be verified empirically. Extending the evaluation across languages and a wider variety of LLMs will help clarify how broadly applicable this approach really is.
Yuichi Sei
University of Electro-Communications
Read the Original
This page is a summary of: Addressing Prompt Injection in Large Language Models via In-Context Learning, Computers Materials & Continua, January 2026, Tsinghua University Press,
DOI: 10.32604/cmc.2026.078188.
You can read the full text:
Contributors
The following have contributed to this page







