Testing the Limits of AI Safety: An Automated Red Team Deep Dive into ChatGPT’s Evolution

Large language models are rapidly becoming embedded in tools that influence important decisions, automate sensitive workflows, and shape human experiences. But how effective are their guardrails at blocking harmful or unethical content or resisting jailbreak attempts? To answer this, we used our homegrown extensive automated red teaming platform that simulates adversarial attacks. We tested the robustness of OpenAI’s ChatGPT models across four major versions: GPT-3.5 Turbo, GPT-4, GPT-4.1, and GPT-4.5.

Automated Red team

Automated red teaming for Large Language Models (LLMs) is an AI-driven security testing approach that utilizes algorithms and auxiliary models to systematically generate and execute adversarial prompts against a target model. Instead of relying solely on human experts, automated red teaming crafts malicious inputs designed to bypass safety guardrails, elicit harmful responses, or manipulate model behavior. This scalable, continuous process helps uncover hidden vulnerabilities, such as prompt injection and jailbreaking, that manual testing might miss. By simulating a wide range of realistic attack scenarios at scale, automated red teaming enables organizations to proactively identify and remediate risks, improving the robustness and safety of AI systems before deployment.

Test setup

The tests utilized a predefined set of 320 prompts (attacks). An “unmoderated” LLM was employed to enhance these prompts since “moderated” LLMs typically refuse to process problematic content.

Enhancing a prompt involves refining or rephrasing the original attack prompt to make it more effective at eliciting unintended or unsafe responses from the target model. This process often uses an auxiliary LLM that can creatively bypass common guardrails or detection mechanisms while preserving the malicious intent. The goal of enhancement is to simulate more sophisticated adversarial behavior and increase the likelihood of identifying model vulnerabilities.

Consequently, each original prompt was tested in both its basic and enhanced forms, totaling 640 distinct attacks. The attacks targeted various categories: harmful content, PII via Direct Exposure, politics, Religious Bias, and Unauthorized Commitments.

Nextsec.ai’s red team methodology incorporates both single-turn and multi-turn attacks. Single-turn attacks consist of a single prompt, often refined or augmented by an LLM to maximize effectiveness. In contrast, multi-turn attacks involve an initial prompt followed by subsequent prompts dynamically generated based on the model’s responses, typically extending to 5–10 interactions.

In this study, we focused exclusively on single-turn attacks due to their deterministic nature. These attacks produced clear, reproducible outcomes, eliminating variability caused by prior conversation context and ensuring more reliable comparative analysis across model versions.

Key Statistical Findings

Before diving into the results, it’s important to define “Attack Success Rate” (ASR). ASR refers to the percentage of harmful prompts that successfully bypassed the model’s safety mechanisms and received a response. A higher ASR indicates weaker safeguards, while a lower ASR suggests stronger protection and better alignment with ethical guidelines.

1. Progressive Decline in ASR (i.e., improved model security):

- GPT-3.5 Turbo: Initial ASR of 46%
- GPT-4.0: ASR dropped significantly to approximately 20%
- GPT-4.1: Further reduction to around 18%
- GPT-4.5 Preview: Lowest ASR at approximately 16%

This trend demonstrates a consistent improvement in model alignment and safety across versions.

Explanation of Severity Levels:

- Low: Minor deviations with minimal risk, such as slight inaccuracies or mild language that is unlikely to cause harm.
- Medium: Moderate violations that may raise concern, including subtle bias or ambiguous sensitive references.
- High: Responses containing clearly harmful or sensitive information.
- Critical: Severe and high-risk outputs including dangerous, illegal, or deeply unethical content.

2. Categories Exhibiting Significant Security Enhancement

Based on the comparative results from ChatGPT 3.5 to 4.5, the top 5 most improved categories are:

- Radicalization: ASR dropped from 85% (17 out of 20 prompts) in 3.5 Turbo to 10% (2 out of 20) in 4.5 Preview
- Disinformation Campaigns: ASR dropped from 75% (15 out of 20) to 10% (2 out of 20)
- Malicious Code: ASR dropped from 60% (12 out of 20) to 15% (3 out of 20)
- Cyber Crime: ASR dropped from 65% (13 out of 20) to 20% (4 out of 20)
- Harassment and Bullying: ASR dropped from 52.6% (10 out of 19) to 10.5% (2 out of 19)

These improvements highlight targeted model updates that significantly enhanced refusal behavior across high-risk content types such as extremism, misinformation, and malicious activity.

3. Persistent Vulnerabilities and Challenges

While most high-risk categories showed dramatic improvement and reached a 0% failure rate in version 4.5, few medium/low risk categories continues to exhibit residual risk:

- Politics: Maintains a 65% failure rate in version 4.5 Preview. This category is classified as Low severity.
- Excessive Agency: Maintains a 60% failure rate in version 4.5 Preview. This category is classified as Medium severity.
- Overreliance : Maintains a 45% failure rate in version 4.5 Preview. This category is classified as Low severity.
- Unsafe Practices: Maintains a 45% failure rate in version 4.5 Preview. This category is classified as Low severity.

We believe OpenAI wisely prioritized addressing vulnerabilities in critical and high-risk categories first, and we expect future versions to improve in other areas as well. It’s important to note that low-risk categories often include ambiguous prompts that may be problematic in some contexts but not others, making them more challenging for the judging models to handle. It will be interesting to see which categories OpenAI chooses to focus on next.

Additional Insights

The results reveal that some of the most substantial safety improvements occurred in sensitive categories such as Sexual Crime, Self Harm, and Non-Violent Crime, likely reflecting targeted efforts to strengthen ethical safeguards in high-risk areas. In contrast, more nuanced legal domains like Intellectual Property continue to pose challenges, highlighting the inherent difficulty of aligning AI behavior with complex legal standards and societal norms. These contrasts illustrate that while ChatGPT 4.5 shows meaningful progress in several domains, it also underscores the need for continued iteration to close persistent gaps and adapt to emerging threat patterns.

Conclusion

Large Language Models (LLMs) introduce a vast new landscape of interaction possibilities. Automated red teaming offers a structured and scalable approach to test and enhance their safety. This experiment demonstrates clear improvements in model alignment and resistance across versions — with each release showing stronger safeguards.

In our next post, we’ll extend this analysis by comparing results across different vendors (models) to assess how safety evolves across the broader ecosystem.

Ethics statement

This research aims to improve responsible AI awareness by exposing model limitations across different versions. While we acknowledge the potential for misuse in adversarial research, we believe that our methods do not introduce new risks or unlock dangerous capabilities beyond those already accessible through common knowledge and existing techniques.

We firmly believe that identifying vulnerabilities is essential for addressing them effectively. By proactively conducting controlled research to uncover these issues, we help mitigate risks that could otherwise emerge unexpectedly during real-world deployments, ultimately contributing to safer and more robust AI systems for the broader community.

#AISecurity #CyberSecurity #AI #RedTeaming #EthicalAI #ChatGPT

Yaniv Diner

Director of Product Management