Bypassing Guardrails: Lessons Learned from Red Teaming ChatGPT

Terry Yue Zhuo; Yujin Huang; Chunyang Chen; Xiaoning Du; Zhenchang Xing

doi:10.1145/3747288

What is it about?

Large language models such as ChatGPT are now widely used to answer questions, write text, generate code, and support everyday tasks. To reduce harmful responses, these systems include safety guardrails. This article studies how well those guardrails work in practice. We test ChatGPT using both standard safety benchmarks and red teaming, where researchers deliberately try to find cases in which the system behaves in unsafe or unexpected ways. Our results show that ChatGPT performs well on many existing benchmarks, but its safety protections can still fail in realistic interactions. For example, it may produce biased code, give different answers across languages, generate toxic text when asked to imitate certain characters, provide misleading information, or respond to harmful requests when they are framed indirectly. These findings show why large language models need stronger, more realistic safety testing before and after deployment.

Photo by Levart_Photographer on Unsplash

Why is it important?

This work is important because it shows that ChatGPT can appear safe on standard benchmarks while still producing harmful, biased, toxic, or misleading outputs in realistic user interactions. By using red teaming, we reveal how safety guardrails can be bypassed through code generation requests, multilingual questions, persona-based prompts, hallucination-prone questions, and prompt injections. These findings highlight the need for more dynamic and context-aware safety evaluations for future large language models.

Perspectives

As a researcher in AI safety and software security, I see this work as an early step toward understanding how powerful language models behave in realistic user interactions. What stood out to me most is that safety cannot be measured only through standard benchmarks. Models may perform well in controlled tests but still fail when users ask questions in different languages, assign personas, or craft indirect prompts. I hope this article encourages researchers, developers, and users to think more carefully about how large language models should be evaluated, deployed, and used responsibly.
Yujin Huang
University of Melbourne

This page is a summary of: Bypassing Guardrails: Lessons Learned from Red Teaming ChatGPT, ACM Transactions on Software Engineering and Methodology, April 2026, ACM (Association for Computing Machinery),
DOI: 10.1145/3747288.
You can read the full text:

Read

Resources

URL
Bypassing Guardrails: Lessons Learned from Red TeamingChatGPT
Ethical and social risks persist as a crucial yet challenging topic in human-AI interactions, especially inensuring the safe usage of natural language processing (NLP). The emergence of large language models(LLMs) like ChatGPT introduces the potential for exacerbating this concern. However, prior works on theethics and risks of emergent LLMs either overlook the practical implications in real-world scenarios, lagbehind rapid NLP advancements, lack user consensus on ethical risks, or fail to holistically address the entirespectrum of ethical considerations. In this article, we comprehensively evaluate, qualitatively explore, andcatalog ethical dilemmas and risks in ChatGPT through benchmarking with eight representative datasets andred teaming involving diverse case studies. Our findings show that while ChatGPT demonstrates superiorsafety performance on benchmark datasets, its guardrails can be bypassed via our manually curated examples,revealing not only the limitations of current benchmarks for risk assessment but also unexplored risks in fivedistinct scenarios, including social bias in code generation, bias in cross-lingual question answering, toxiclanguage in personalized dialogue, misleading information from hallucination, and prompt injections forunethical behaviors. We conclude with implications from red teaming ChatGPT and recommendations fordesigning future responsible large language models.

Contributors

The following have contributed to this page

How ChatGPT safety guardrails can fail under red teaming

What is it about?

Why is it important?

Perspectives

Resources