Dark room with multiple computer screens displaying code, representing cybersecurity threats

PolicyApril 18, 2026

Anthropic Built Cyber Walls Around Claude 4.7. The Dark Web Had Jailbreaks Ready in Hours.

Anthropic shipped production cyber safeguards for the first time. The underground tested workarounds before the press release was cold.

The AI Post

The AI Post newsroom — delivering AI news at the speed of intelligence.

Anthropic released Claude Opus 4.7 on April 16 with a first: production cybersecurity safeguards baked into a commercial model. The company deliberately reduced the model's offensive cyber capabilities, gated advanced access behind a Cyber Verification Program, and restricted its most powerful model, Mythos, to roughly 50 vetted partners under Project Glasswing.

It took the dark web less than 48 hours to route around all of it.

The Underground Already Had the Playbook

A detailed analysis published by Suzu Labs and syndicated through Security Boulevard lays out the gap between Anthropic's ambition and the underground's reality. On April 13, three days before the Opus 4.7 launch, a user on Dread (the primary Reddit-style forum on Tor) posted blunt operational advice: stop using safety-stripped open-source models. Use frontier models with better prompts. The post described getting Claude, Gemini, and ChatGPT to produce "fully functioning, ready to deploy payloads with just a little bit of effort."

That is not a theoretical concern. It is an operator telling other operators what works in production.

Within the same week, the Suzu Labs report documented a cascade of underground activity: an "ENI GEM" Gemini jailbreak circulating on Reddit and Dread, a "GROK JAILBREAK free 2026" thread on DarkNetArmy drawing 40+ replies in four days, a Russian-language Telegram channel with 170,000 subscribers publishing guidance on using AI to reverse-engineer binaries and find zero-days without source code, and a single-line prompt injection that reportedly breaks both ChatGPT and Gemini being forwarded across Telegram channels.

Hours after the Opus 4.7 announcement, a Russian hacker-for-hire operator on forum_exploit posted: "Just noticed that Opus 4.7 came out today. They say it's more accurate and reasons more based on the first tests. That's interesting." His signature line reads: "I'll hack your target."

The Numbers Tell the Story

Anthropic's own CyberGym benchmark scores are revealing. Mythos scores 83.1%. Opus 4.7, the deliberately nerfed version, scores 73.1%. GPT-5.4 sits at 66.3%. Opus 4.7 is still the most cyber-capable model on any public API. Anthropic reduced the ceiling, not the floor.

And defenders face friction that attackers do not. The Cyber Verification Program requires an application, a vetting process, and approval. Jailbreaks move across Reddit, Dread, DarkNetArmy, and Telegram in hours. The distribution asymmetry is the vulnerability.

Same-Day Proof the Problem Is Architectural

While Anthropic was announcing the new safeguards, security engineer Aonan Guan and Johns Hopkins researchers published a disclosure of a cross-vendor prompt injection attack they call "Comment and Control." A single injection pattern delivered through a GitHub pull request title, issue body, or comment hijacks Claude Code Security Review, Google's Gemini CLI Action, and GitHub Copilot Agent simultaneously. It steals API keys for all three services plus any secrets exposed in the GitHub Actions runner. Exfiltration runs through GitHub itself. No external infrastructure required.

The timing is brutal. On the same day Anthropic launched its most security-conscious model ever, researchers proved that the tools wrapping these models are vulnerable to trivial injection attacks.

What This Actually Means

Anthropic deserves credit for trying. No other frontier lab has shipped production cyber safeguards into a commercial model. The Cyber Verification Program, Project Glasswing's $100 million in partner credits, and the deliberate capability reduction in Opus 4.7 are real commitments to responsible deployment.

But the underground intelligence makes one thing clear: the wall is in the wrong place. The threat is not frontier model capabilities in isolation. It is the combination of frontier models plus social engineering plus distribution asymmetry plus tool-level vulnerabilities. Anthropic is securing the model layer while the attack surface sits in the tooling, the prompts, and the distribution channels.

OpenAI took a parallel approach yesterday, restricting GPT-5.4-Cyber to its Trusted Access program. The pattern across both labs is identical: gate capabilities for defenders, watch attackers route around the gates.

The adoption window between a model release and underground testing is now measured in hours. The question is no longer whether AI labs can build effective cyber guardrails. It is whether guardrails are the right metaphor at all.

First reported by Suzu Labs via Security Boulevard. Cross-vendor prompt injection disclosure by Aonan Guan, Zhengyu Liu, and Gavin Zhong (Johns Hopkins University).

anthropiccybersecurityclaudejailbreaksdark webopus 4.7

THE AI POST

The Underground Already Had the Playbook

The Numbers Tell the Story

Same-Day Proof the Problem Is Architectural

What This Actually Means