Why AI Safety Theater Makes Everyone Less Safe
AI Summary
What is AI safety theater? AI safety theater is the pattern of visible compliance activity that produces documentation, checklists, and governance ceremonies without changing underlying AI system behavior. Risk registers that never block a deployment. Red-team reports that get filed after shipping. Policy documents that describe behavior no one audits. The theater looks like safety. It does not produce safety.
Why it matters: Real safety work costs money, slows deployment, and sometimes kills features. Theater costs a compliance budget and produces artifacts auditors can tick off. When the two compete for institutional attention, theater wins because its outputs are legible to boards and regulators in ways real engineering work is not.
The rule: Watch what the lab does when the safety finding would delay a release. Theater gets overridden. Real safety engineering blocks the ship. If you cannot name a recent release the lab delayed for safety reasons, you are watching theater.
Table of Contents
- What AI safety theater actually is
- Why theater beats engineering in organizational incentives
- The compliance paperwork trap
- The red team report that ships after release
- Responsible scaling policies and their escape hatches
- The regulatory layer and what it actually demands
- How to tell real safety work from theater
- What theater costs operators and users
- What real AI safety engineering looks like
- What to do inside a theater-heavy industry
- Frequently asked questions
What AI safety theater actually is
The phrase compliance theater predates AI. The concept goes back at least to Bruce Schneier’s writing on security theater after 9/11, where elaborate airport screening procedures produced visible activity without meaningfully reducing hijacking risk. The TSA became the canonical example of security spending that produces compliance artifacts and passenger anxiety without stopping actual attacks.
AI safety theater is the same pattern applied to artificial intelligence governance. A lab publishes a responsible scaling policy. The policy describes capability thresholds that would trigger additional safety work. The lab then ships models that approach or cross those thresholds and releases them anyway, usually with commentary about how the thresholds are directional rather than binding. The policy document still exists. Critics can point to it as evidence of safety commitment. Nothing the policy describes actually gates any deployment.
Oliver Patel, who leads enterprise AI governance at AstraZeneca, wrote one of the clearer public diagnoses of this pattern in a February 2026 essay. His framing is that AI governance splits into two schools, one focused on technical guardrails and one focused on policy documentation, and that treating either as sufficient produces theater. The documentation school produces artifacts without technical teeth. The guardrails school produces code without organizational accountability. Real governance integrates both, and integration is hard enough that most organizations default to one or the other.
The specific failure modes I see repeatedly include safety review processes that run in parallel with shipping decisions rather than upstream of them, risk registers that track findings but do not block releases, red-team engagements that happen after features are already built, and policy documents that describe behavior no one is actually auditing for compliance. Each of these is easy to produce. Each of them creates legible activity for boards, regulators, and journalists. None of them stop the ship when the finding would cost money or time.
Why theater beats engineering in organizational incentives
Theater is not an accident or a lack of sincerity. It is the predictable output of an organizational incentive structure where compliance artifacts are valued and operational safety is not.
Boards want to see that safety is being taken seriously. What they can evaluate is documentation. They cannot read model weights. They cannot audit training pipelines. They can read a policy document, see a risk register, review a governance committee’s meeting minutes, and conclude that the lab is behaving responsibly. The documentation becomes the primary artifact boards interact with, and the organization optimizes for producing what the board can evaluate.
Regulators run into the same problem from outside. An auditor visiting a lab cannot meaningfully assess whether a model is safe. They can verify that policies exist, that training data is logged, that evaluations have been run, that incident response procedures are documented. The regulatory requirements that get written tend to demand the artifacts the regulator can verify, which reinforces the incentive to produce artifacts rather than behavior.
Inside the lab, the safety team is typically smaller and less politically powerful than the capability research team and the commercial team combined. When a capability research team wants to ship a new model and the safety team has findings, the conflict resolution goes to leadership. Leadership has strong financial and competitive incentives to ship. The safety findings that get treated as blocking are the ones severe enough to survive a political process where the default outcome is shipping anyway.
This produces an internal selection effect. Safety engineers who learn to frame findings in ways that do not block releases get promoted. Safety engineers who insist on blocking releases they believe are unsafe eventually leave or get pushed out. Over time, the institutional memory of what a real safety block looks like fades. The policies still describe a world where safety can halt a shipment. The actual organizational muscle memory for halting one has atrophied.
I am not naming specific companies here because the pattern is structural rather than individual. Every frontier lab operates inside this incentive structure. Some resist it better than others. None of them resist it perfectly. The result is an industry where safety theater is the dominant form of safety work, not because anyone chose theater over engineering, but because the selection pressures favor theater at every decision point.
The compliance paperwork trap
The regulatory landscape in 2026 has amplified the paperwork problem. California’s Transparency in Frontier AI Act took effect January 1, 2026, requiring frontier developers to publish safety and security frameworks and report safety incidents. The New York RAISE Act imposes similar transparency and risk assessment requirements. Colorado’s AI Act takes effect June 30, 2026, with mandatory risk management programs and impact assessments for high-risk systems.
Each of these laws was written by legislators trying to produce real accountability. Each of them ends up creating demand for compliance documentation that can be produced without materially changing AI system behavior. A lab facing a California SB 53 disclosure requirement can comply by publishing a framework document that describes how they think about safety. The document does not have to describe decisions the lab has actually made. It does not have to describe risks that actually constrained a release. It just has to exist and satisfy the disclosure requirement.
The EU AI Act is the most ambitious attempt to avoid this trap. High-risk AI system rules become effective August 2, 2026, though some of the most onerous provisions may be delayed to 2027 or 2028 as technical standards get finalized. The Act attempts to tie compliance obligations to specific system behaviors, not just policy documentation. Whether the Act produces real safety outcomes or another layer of paperwork depends on how enforcement actually runs, which nobody will know until the enforcement actions begin.
Insurance carriers have added their own layer. AI-specific security riders now routinely require documented adversarial red-teaming, model-level risk assessments, and alignment with recognized AI risk management frameworks as conditions of coverage. This creates demand for compliance artifacts independent of what regulators require. The artifacts may or may not reflect real safety work. They do reflect real insurance underwriting requirements, which makes producing them a business necessity regardless of their safety value.
Stack the regulatory requirements, the insurance requirements, the board-level governance expectations, and the public-facing marketing claims together, and the cumulative paperwork burden is substantial. The paperwork burden is large enough that most of a compliance team’s bandwidth goes to producing artifacts. Real safety engineering work, the kind that blocks releases when findings are serious, gets squeezed by the paperwork that has to happen regardless.
The red team report that ships after release
Red teaming is the structured practice of having independent or semi-independent teams attempt to find failure modes, safety violations, or misuse patterns in AI systems before deployment. Real red teaming is a high-value activity. Ritualized red teaming is theater.
The telltale signs of ritualized red teaming are findings that are acknowledged but not acted on, reports that get filed after the release is scheduled regardless of what they contain, and remediation commitments that describe future work rather than current fixes. A red team report that says “the model can be jailbroken via the following seven techniques” and then sits in a drawer while the model ships with those vulnerabilities unaddressed is a ritual, not a control.
I have watched this pattern across multiple model releases from multiple labs. The red team runs. Findings surface. Some findings get fixed. Some findings get categorized as acceptable risk. Some findings get deferred to the next release cycle, where they get deferred again. The categorization of which findings block a release and which do not happens through a political process where the capability team and commercial team push back against the safety team. The safety team usually loses.
The public-facing artifact from this process is a model card or safety evaluation report that describes the red teaming as having occurred and lists some findings as addressed. The report rarely describes the findings that were categorized as acceptable risk. The report never describes the findings that were quietly deferred. Readers of the report cannot tell whether red teaming actually constrained the release or whether it was a ritual.
Real red teaming requires organizational commitment to let red teams block releases on serious findings, independence from the capability team so findings are not politically negotiable, and transparency about what findings were made and how they were resolved. Most labs have one or two of these elements. Very few have all three. The ones that do not have all three are running red team theater, regardless of how professional the red teamers are or how serious their findings might be.
Responsible scaling policies and their escape hatches
Anthropic pioneered the responsible scaling policy concept. The idea is that a lab publishes tiered capability thresholds, each tier requiring progressively more stringent safety measures before deployment. A model that crosses a threshold without the required safety measures in place cannot be deployed. The policy is supposed to function as a commitment device, binding the lab’s future behavior to its current stated values.
In practice, responsible scaling policies have escape hatches built in. The thresholds are described using language like “approximately” or “significantly above.” The triggering evaluations are conducted internally by the same lab that wants to deploy. The remediation requirements are described in terms of processes that have to be run, not outcomes that have to be achieved. When a model approaches a threshold, the lab has multiple levers to avoid triggering the full safety response, including redefining what the threshold means, claiming the model is below the threshold despite evaluation results suggesting otherwise, and accepting the trigger while arguing that the existing safety measures are adequate.
Anthropic’s actual behavior around Claude Opus 4.7 is a case study. I documented the regression in detail in my field report on Opus 4.7. The relevant point here is that Anthropic shipped a model that regressed on matched workloads compared to Opus 4.6, with no public acknowledgment of the regression, and has not rolled back the deployment. A responsible scaling policy that commits to safety-first deployment should, in theory, have produced either a rollback or a public explanation of why the regression was considered acceptable. Neither happened. The policy exists. The behavior the policy describes did not occur.
I am using Anthropic as the example because they have the most public commitment to responsible scaling. OpenAI, Google DeepMind, and other frontier labs have similar frameworks with similar escape hatches. The pattern is industry-wide. Labs publish scaling policies, deployments happen that arguably should have been gated by those policies, the policies are not meaningfully updated, and the public discourse continues as if the policies are functioning as designed.
This is not a critique of any individual person at any of these labs. It is an observation about how incentive structures shape organizational behavior over time. Responsible scaling policies, like red teaming protocols, require organizational commitment to let the policy block a release when the finding is serious. Without that commitment, the policy is another form of theater.
The regulatory layer and what it actually demands
2026 is the year AI regulation became real across multiple jurisdictions. California’s Transparency in Frontier AI Act, New York’s RAISE Act, Colorado’s AI Act, the EU AI Act phase two, Texas’s RAIGA, and Trump’s December 2025 Executive Order on national AI policy framework all came into effect or became enforceable inside roughly a twelve-month window.
What these laws actually demand varies. The common thread is disclosure and documentation. California SB 53 requires frontier developers to publish safety frameworks and report incidents. Colorado’s AI Act requires risk management programs and impact assessments for high-risk systems. The EU AI Act imposes conformity assessments, transparency requirements, and specific prohibitions on certain AI applications. Texas RAIGA gives the state attorney general investigative powers to demand detailed system information.
What the laws do not effectively require is specific safety behavior. A lab can comply with all of the above by producing documentation that describes what they do. The laws rarely specify what counts as adequate behavior. They specify what counts as adequate disclosure. The theater-versus-engineering distinction exists upstream of the regulatory layer, and the regulations mostly fail to address it.
The one regulatory development that might shift this is the emergence of specific technical standards that compliance has to demonstrate. ISO 42001 on AI management systems, the NIST AI Risk Management Framework, and the EU AI Act’s anticipated harmonized standards all attempt to specify what adequate practice looks like. If these standards become the enforcement baseline and if they get updated aggressively as AI capability evolves, they could shift compliance from documentation to behavior. That is a large if, and the track record of technical standards keeping pace with rapidly evolving technology is not encouraging.
Trump’s December 2025 Executive Order complicates the landscape further by directing federal agencies to consider preempting state AI laws. If federal preemption succeeds, the patchwork of state requirements could collapse into a single federal framework. Whether that framework imposes more or less real accountability than the state patchwork is unknown. The Order as drafted preserves state authority over child safety, AI compute and data center infrastructure, and state government procurement, while leaving other categories open to preemption.
How to tell real safety work from theater
Distinguishing real safety engineering from theater is harder than it should be because theater is designed to look like real work. Here are the tests that actually discriminate.
First test. Can you name a recent release the lab delayed or canceled for safety reasons? If yes, the safety process has teeth. If no, every release has shipped on the commercial schedule regardless of what the safety process produced, which means the safety process is not gating anything.
Second test. When a safety evaluation surfaces a finding, does the finding produce a concrete remediation that ships before the release, or does it produce a remediation described in future tense? Remediations that ship are real. Remediations described as planned work for a later release are theater. The “we will address this in the next version” pattern is particularly suspect because the next version has its own remediation backlog that will also get deferred.
Third test. Are the red team engagements independent from the capability team, or do they report into the same leadership structure that wants to ship? Independent red teams can block releases. Red teams that report to the person who gets their bonus for shipping cannot.
Fourth test. Does the lab publish the findings of its safety evaluations, including the findings that were not addressed? Labs that publish incomplete reports showing only the addressed findings are running theater. Labs that publish the full finding set including what was left unaddressed are doing something real.
Fifth test. When a model’s real-world behavior reveals failure modes the safety evaluations did not catch, how does the lab respond? Labs that update their evaluation frameworks and acknowledge the gap are doing real work. Labs that explain why the failure mode was not really a failure, or that attribute it to misuse rather than system design, are running theater.
Few labs pass all five tests. Some pass two or three. None that I have observed pass all five consistently. That is the state of AI safety practice in 2026, and it is not specific to any individual lab. It is the industry-wide default.
What theater costs operators and users
Theater is not harmless. It imposes real costs on the people who use AI systems and the operators who deploy them.
The first cost is false confidence. A user interacting with a model that has been described as rigorously red-teamed and responsibly scaled reasonably concludes that serious failure modes have been addressed. When those failure modes then appear in production, the user is unprepared because the marketing and compliance artifacts implied they had been solved. The gap between claimed safety and actual safety produces more harm per incident than honest acknowledgment of uncertainty would.
The second cost is regulatory capture. When the regulatory framework demands documentation and the labs produce documentation, the regulatory framework concludes the labs are compliant. The framework then does not evolve to demand behavior because the compliance metrics are being met. The regulatory system becomes an ally of the theater rather than a check on it. This happens slowly and is hard to reverse once it sets in.
The third cost is market distortion. Labs that invest in real safety work face higher costs and slower deployment than labs that invest in theater. In a competitive market, the theater labs can move faster, capture more market share, and use their market position to set the industry standard. The selection pressure favors theater over engineering at the market level, which reinforces the selection pressure that favors theater over engineering inside each individual lab.
The fourth cost is the one I care about most as an operator. Theater erodes trust in the claim that AI safety work is possible at all. When sophisticated users watch labs publish responsible scaling policies, ship capability regressions, and then claim the policies are working as intended, the rational conclusion is that safety discourse is marketing copy rather than substantive practice. This produces cynicism that makes it harder to distinguish the labs doing real work from the ones running theater. Everyone gets painted with the same brush. The labs that are actually trying lose credibility with the audiences who would most benefit from trusting them.
What real AI safety engineering looks like
Real AI safety work is identifiable by its artifacts, its organizational structure, and its outcomes.
The artifacts include mechanistic interpretability research that reaches publishable results, adversarial evaluations with concrete pass-fail criteria, alignment research that reaches product integration rather than staying academic, and documented instances of deployment decisions that were changed based on safety findings. Labs doing real work produce all of these. Labs doing theater produce policy documents and governance committee minutes.
The organizational structure includes safety leadership that reports directly to the CEO rather than through capability research, independent safety advisors with actual authority to escalate, red teams that are structurally separated from the teams whose work they evaluate, and publication norms that let safety researchers disclose findings without needing capability leadership sign-off. Labs with this structure can produce real safety outcomes. Labs without it produce whatever capability leadership decides is acceptable.
The outcomes include shipped safety improvements that made it to production, visible deployment delays on safety grounds, public acknowledgment of safety failures the lab caught internally, and track records of responding to external researchers who find vulnerabilities. Over a multi-year horizon, you can see which labs produce these outcomes and which ones produce press releases about producing these outcomes.
I want to name something important here. Anthropic, OpenAI, and Google DeepMind all have individual researchers doing real safety work. The theater criticism is not aimed at those researchers. It is aimed at the organizational structures that make it hard for their real work to produce organizational outcomes. The researchers know this. Many of them have written publicly about their frustration. The safety researchers at frontier labs are generally more critical of their labs’ safety practices than outside observers are, because they can see the gap between what the public-facing materials claim and what the actual deployment decisions look like.
Fixing this requires organizational change at the lab level, not better policy documents. The question is whether the competitive pressure between labs can be redirected from a race to capability toward a race to demonstrated safety. So far that redirection has not happened. It may not happen without regulatory intervention that specifically targets behavior rather than documentation, or without a serious safety incident that forces the industry to reckon with its theater.
What to do inside a theater-heavy industry
If you are using AI tools professionally, the theater-versus-engineering distinction matters for your own risk management even if you cannot change the labs’ behavior.
First, do not trust safety claims you cannot verify. If a lab says its model has been rigorously evaluated, that means whatever the lab’s evaluation process produced. If the evaluation process is theater, the claim is marketing. Verify by looking for the tests I listed above. Can you name a delayed release? Are remediations shipping or deferred? Are red teams independent? Are findings published? If you cannot answer yes to most of these, treat the safety claim as aspirational rather than descriptive.
Second, assume failure modes exist that the lab has not caught. Deploy with monitoring, fallback procedures, and human review for high-stakes decisions. The labs running theater will not catch their own failure modes until users surface them. The labs running real safety work will catch more but will not catch all of them. Either way, you need your own monitoring.
Third, maintain a diversified stack. If your operations depend on a single lab’s model, you absorb that lab’s failure modes without recourse. My multi-AI stack approach is partly a response to this. Running Claude, GPT, Grok, and Cerebras in parallel means any one lab’s safety failure does not shut down my operations. Redundancy is the operator’s hedge against theater.
Fourth, pay attention to the ship-first triage-second pattern. Theater and ship-first triage-second are the same thing from different angles. Theater is what the pattern looks like in public communications. Ship-first triage-second is what the pattern looks like in release cadence. A lab running either is running the other.
Fifth, do not lose sight of the real safety researchers inside each lab. The systemic critique of theater is not a criticism of the people trying to do real work inside the system. Those people exist and deserve support. The goal is to distinguish their work from the organizational wrapper around it, not to dismiss the whole field.
Frequently asked questions
What is AI safety theater?
AI safety theater is the pattern of visible compliance activity that produces documentation, checklists, and governance ceremonies without changing underlying AI system behavior. Risk registers that never block deployments. Red team reports that get filed after shipping. Policy documents that describe behavior no one audits. The theater looks like safety but does not produce safety outcomes.
Why do AI labs run safety theater instead of real safety engineering?
The organizational incentive structure favors theater. Boards evaluate documentation rather than model weights. Regulators demand artifacts they can verify rather than behavior they cannot. Capability teams outrank safety teams. Over time these selection pressures produce theater even when individual researchers are trying to do real work.
Are all AI safety efforts theater?
No. Individual researchers at frontier labs do real work. Some of that work produces real outcomes. The critique is of organizational structures that make it hard for real work to translate into deployment decisions. Mechanistic interpretability research, alignment research that reaches product, and safety engineering that actually blocks releases on findings are real. Policy documents and governance committees without teeth are theater.
How do I tell if a lab is running safety theater?
Ask five questions. Can you name a recent release the lab delayed for safety reasons? Do safety findings produce remediations that ship before release or get deferred to future work? Are red teams structurally independent from capability teams? Does the lab publish its full evaluation findings including unaddressed ones? Does the lab acknowledge safety failures its evaluations missed? Labs that fail multiple questions are running theater.
Does AI regulation fix the theater problem?
Not by itself. The major regulations in effect or taking effect in 2026 (California SB 53, New York RAISE Act, Colorado AI Act, EU AI Act phase two) mostly demand documentation rather than specific behavior. Documentation requirements can be satisfied by theater. Regulation that addresses theater would need to specify behavior outcomes, not compliance artifacts. That regulation does not currently exist at scale.
What is a responsible scaling policy?
A responsible scaling policy is a published framework from an AI lab describing capability thresholds and the safety measures required before deploying models that cross those thresholds. Anthropic pioneered the concept. In practice, the policies often include escape hatches that allow deployment regardless of threshold status. The policy exists. Its binding effect on deployment decisions varies.
Why does theater cost more than just being inefficient?
Theater creates false confidence among users who reasonably conclude that failure modes have been addressed. It produces regulatory capture where oversight bodies conclude labs are compliant when documentation is produced. It creates market distortion where theater labs move faster and capture market share from engineering labs. It erodes trust in safety discourse generally, making it harder to distinguish real work from performance.
What should I do as an AI user?
Treat safety claims as aspirational unless you can verify them. Deploy AI with monitoring, fallback procedures, and human review for high-stakes decisions. Maintain a diversified multi-model stack so no single lab’s failures shut down your operations. Watch for ship-first triage-second patterns as the operational signature of theater.
Are there labs doing this right?
No lab consistently passes all five discrimination tests. Some pass more than others. Anthropic publishes more alignment research than its peers. Google DeepMind has a stronger internal safety research track record. OpenAI has the broadest deployment experience. Each of these labs also shows clear theater signatures in specific incidents. The goal is not to identify the pure lab but to recognize the mixed nature of the current practice.
Will AI safety practice improve?
Possibly, under two conditions. Either regulation evolves to demand behavior rather than documentation, which requires technical sophistication among regulators that is building but not yet widespread. Or the industry experiences a safety incident serious enough to force a reckoning with its theater. Neither path is guaranteed. The default trajectory is continued theater with marginal improvements at the edges.