AI red teaming

AI red teaming is the practice of adversarially probing an AI system to find vulnerabilities — jailbreaks, prompt injections, harmful outputs, biased decisions, or factual hallucinations.

What is AI red teaming?

AI red teaming applies the security industry's red-team concept to AI systems: a structured adversarial probe to find weaknesses before they're exploited in production. For frontier language models the targets typically include jailbreaks (bypassing safety policies), prompt injection (untrusted input overriding system instructions), harmful content generation, biased decisions across demographic slices, factual hallucinations, and capability uplift for dangerous workflows.

The discipline is recent — most foundational AI providers (OpenAI, Anthropic, Google DeepMind, Meta) have had internal red teams since 2022-2023. Third-party red-team programs are emerging; the U.S. AI Safety Institute and the UK AI Security Institute formally evaluate frontier models pre-release.

What buyers should look for

Mature AI vendors disclose: a documented red-team program, the categories of harm tested (jailbreaks, prompt injection, biased outputs, dangerous-capability evals), whether the program runs continuously vs only at release, whether external red teams participate, and how findings are tracked and remediated. The absence of any of these is a yellow flag — not disqualifying, but worth raising in procurement diligence.

Red teaming in regulation

The EU AI Act requires "robustness, accuracy, and cybersecurity" testing for high-risk AI systems (Article 15), which most providers interpret to include red teaming. NIST AI RMF Generative AI Profile (NIST AI 600-1) explicitly recommends red teaming under the Measure function. The U.S. Executive Order on Safe, Secure, and Trustworthy AI (October 2023) mandated red-team reporting for foundation models above a compute threshold.