Stop Trying to Jailbreak Claude

China tried it with 24,000 fake accounts. A bug bounty tried it for 1,000 hours.

Jun 15, 2026

Last week, the most powerful Claude ever built went public and almost instally shut down by US goverment. It is called Fable. The first thing they did was put it on a leash.

Ask it the genuinely hard questions, the ones about security or biology, and it quietly hands you off to a weaker model instead. The truly unleashed version exists. A small, vetted group working alongside governments gets to use it.

You do not.

And here is the part worth being honest about. The reason is not corporate cowardice.

The model that can compress a month of your work into an afternoon is the same model that can find a door into a power grid. That is not two dials. It is one. Turn it up for the genius, and you turn it up for the catastrophe at the exact same time. The leash is not stupid. It is a real and painful dilemma, and anyone pretending otherwise has not thought about it for more than a minute.

But now sit with the shape of what we just built.

A handful of people at the very top hold the real frontier. Everyone else gets the careful version, the one that routes the hard question somewhere safer. Is that fair? And the bigger question, the one that actually keeps me up: is this how it is going to be from now on? A velvet rope running straight down the middle of the most important technology of our lives?

Let me say something gentle here, because I think it gets missed in all the panic about “bad actors.”

If you have ever wished you could just have the full thing, you are not a villain. You are a person who can feel the ceiling and wants the better tool. That is not a character flaw. It is the most human instinct there is.

And thousands of you feel it. I have the receipts.

Every month, strangers find this newsletter the same way. They open Google. They type four words. How to jailbreak Claude. And they land on something I wrote in February, which is now the most-read thing I have ever published. 220,000 impressions. More traffic than essays I bled over for weeks.

Those people are not criminals. They are people who want in.

So let me say plainly what I actually believe. Frontier knowledge should not live behind a velvet rope. The best thinking the world can produce should be something an ordinary person can reach, understand, and put to work in their real life. Not in a pilot program. Not someday. Every single day.

The danger is the opposite world. The one where a few models swallow everything they see, every industry quietly hands over its expertise, and the rest of us end up renting back a commoditized version of what used to be ours.

“A frontier without an ecosystem is not stable.”

Satya Nadella@satyanadella

https://t.co/vLmiBKTtX3

3:33 PM · Jun 14, 2026 · 42M Views

2.13K Replies · 5.43K Reposts · 28.3K Likes

He is right. The fair frontier is not one you break into. It is one you can build on top of, where value flows outward instead of pooling at the top, and where you own the loop that turns your own knowledge into something that compounds.

Which brings me to the uncomfortable part.

Wanting the frontier is human, and it is fair. But the way almost everyone is trying to get it is the entirely wrong door. And almost every person typing those four words into Google is asking the wrong question.

The most-read thing I ever published teaches people to pick the lock. This one is about why the lock was never the point.

What I actually wrote, and what you read into it

The February piece was not a trick. I read Claude’s 23,000-word constitution so you did not have to, and the real takeaway was almost boring.

Refusals come in types. Some are hard limits that nothing unlocks. Most are not. And here is the part that actually matters: Claude does not judge your prompt as a one-off. It treats your request as if a million people sent the same thing, then asks what happens if it says yes to all of them.

So the move was never a secret password. The move was context. Show it you are in the 99.9 percent, not the 0.1 percent. Reframe offensive into defensive, manipulation into persuasion, circumvention into understanding.

That is what I wrote.

Here is what a lot of people heard: there is a hidden superpower in there, and the right words set it free.

Call it what it is. The Locked Room Fallacy. The belief that the value lives behind the locked door, and that your job is to get the door open.

It does not. And the most expensive proof of that just played out at the scale of nations.

The Extraction Reflex

There is one instinct underneath all of this. The urge to take the forbidden output instead of building the unglamorous, valuable thing.

I call it the Extraction Reflex.

The person crafting roleplay prompts at midnight is running it. So is a government-backed lab. Same reflex. Wildly different budget.

Watch what happens when that reflex gets a war chest.

China ran your search at industrial scale

In February, Anthropic disclosed something that should reframe this entire conversation. Three AI laboratories, DeepSeek, Moonshot, and MiniMax, ran industrial-scale campaigns generating over 16 million exchanges with Claude through roughly 24,000 fraudulent accounts, in order to train their own cheaper models on its outputs.

That technique has a name. Distillation. You query a strong model a huge number of times, then train a weaker, cheaper model on its answers until the student starts to sound like the teacher.

Now look at the two activities side by side.

The hand-craft version is jailbreaking. One clever prompt at a time, trying to pull out something the model would rather not give. The industrial version is distillation. The same extraction, automated, at the scale of millions of queries.

This is not my metaphor stretching to make a point. The US government’s own memo on the campaigns spelled it out: the operations leaned on tens of thousands of proxy accounts and used jailbreaking techniques to pull proprietary behavior out of American models. The lock-pickers and the lab-thieves are running the same toolkit. One does it for a screenshot. The other does it to clone a frontier.

A fair footnote, because I am not interested in cheap outrage: distillation itself is not the crime. It is a normal, legitimate training method, and labs routinely distill their own models to make smaller, cheaper versions. The problem was the theft, the fake accounts, the terms violated. The method is fine. The reflex is the tell.

And here is what almost nobody is asking.

Everyone is fixated on what got extracted. Nobody is asking what could not be.

You can copy an output. You cannot copy the loop that produced it. The proprietary data. The workflows. The distribution. The trust. A distilled model is a photograph of yesterday’s answers. It does not own the machine that generates tomorrow’s.

You can distill an output. You cannot distill a flywheel.

DeepSeek can clone what Claude said last Tuesday. It cannot clone the thing that makes next Tuesday better. That gap is the whole game, and we will come back to it.

Fable just handed you the “unlocked” model, with a leash

Last week, the timing got almost too perfect.

Anthropic shipped Claude Fable 5, the first public model from its top “Mythos” tier. On real software engineering tasks it scores 80.3 percent where the previous flagship sat near 69, and Stripe says it ran a migration on a 50-million-line codebase in a single day, work that would have taken a full team over two months. The most capable thing they have ever released to the public.

Now read the fine print. The moment you wander into cybersecurity, biology, chemistry, or distillation, Fable quietly hands your question to a weaker model instead. The frontier model refuses to behave like a frontier model in exactly the places that matter.

So the thing jailbreakers fantasize about, a maxed-out model with the guardrails off, already exists. It is called Mythos 5. You cannot have it. Not because Anthropic is being precious, but because the same model is good enough to find exploitable holes across every major operating system and browser, so it ships only to vetted defenders and infrastructure providers working alongside the government.

Read that twice. The gate is not hiding a superpower from you. The gate is the superpower. Capability and danger are not two dials. They are the same dial.

And for the prompt-hackers specifically, here is the humbling part. Before release, Anthropic ran an external bug bounty, over 1,000 hours of professional attempts, and nobody found a universal jailbreak.

The best adversarial minds on earth spent the equivalent of six weeks of round-the-clock effort and came up empty. You, at midnight, with a clever grandmother story? Come on.

The trend is the opposite of what the forums tell you. The lock is not getting flimsier. It is getting better, because the thing behind it is getting scarier. Days before shipping Fable, Anthropic publicly argued AI is advancing dangerously fast and called for a coordinated brake pedal, warning about models that could start improving themselves.

Why you should not, the smart version

I am not going to wag a finger at you. “It is naughty” has never stopped anyone interesting.

The real argument is colder. It is a losing trade.

It is a treadmill. They patch, you re-break, they patch again. Forever. Your effort compounds to exactly zero. The lock you defeated tonight is already different tomorrow.
The prize is junk. The forbidden output is usually low-value and available elsewhere, or it is genuinely dangerous, which means dangerous to you too.
The opportunity cost is brutal. You are spending your scarcest resource, creative energy, defeating a guardrail that is not even standing in front of the work that actually pays.
The heat is real now. There is a Deterring American AI Model Theft Act in Congress and the largest labs have formed an alliance specifically to detect and block this behavior. This is no longer a gray-area hobby.

Here is the line I want you to keep.

The refusal you hit is a wall that 99 percent of users never reach, because they are too busy doing the 99 percent of the work that needs no unlocking at all.

And here is the thing nobody chasing the unlock wants to hear.

The moat was never the model. Everyone has the same model now. Your competitor, the kid in his dorm, the lab in Hangzhou. Want proof of the obsession? There is close to fifteen million dollars on Polymarket right now, real money, staked on which company holds the best AI model this month, the one thing that stops mattering the second everyone can rent it.

So if the model is not the edge, what is?

That is the part I usually keep for the people I actually work with.

Break Things

If you genuinely love finding where models crack, good. That instinct is valuable. There is an entire discipline built around it, and unlike jailbreaking, it compounds.

It is called red-teaming. Next to it sit alignment research, interpretability, and evals. The proper channel is responsible disclosure and bug bounties, the exact program that hardened Fable before launch.

Same curiosity. Completely different payoff.

The forum jailbreak gets you a screenshot and a few hours of clout. The disclosure gets you a credential, a bounty, a reputation, and a seat at the table where the actual decisions happen. One is a parlor trick. The other is a career.

This is the “should.” If you are here for science, that is the science. Turn the reflex into a contribution.

The smarter game

The edge is everything you wrap around the model. Proprietary context. Workflows. Agent systems. Distribution. Trust. I made this case in AI Generalist, and it is only more true after Fable: the person who wins is not the one with the best model, it is the one with the best loop.

This is the difference between two kinds of people.

Extractors try to take a snapshot. Integrators build a machine. The jailbreaker and the distiller are both extractors, grabbing a frozen frame of someone else’s capability. The integrator builds something that gets better every week and cannot be photographed.

The genuinely frightening “jailbreak” was never one bad answer anyway. It is an autonomous agent acting on outputs at scale, which is precisely the thing I unpacked in Sub Agents. Which is also exactly why the smart move is to build legitimate agent systems on top of aligned models, not to weaponize a leaked one.

And the arbitrage is sitting in plain sight. While the prompt-hackers chase an unlock, operators are pointing the very same public Fable at real problems and compressing months into days. The capability is right there, in front of the lock, untouched by almost everyone, because almost everyone is too busy trying to get behind it.

The 0.1 percent spend their genius breaking the lock. The labs in China spend billions cloning yesterday’s answers. The 1 percent never touch the lock at all. They are too busy building the thing everyone else will try to copy next year.

Be the 1 percent.

Stop trying to break the model. Start being impossible to copy.

Post-Credit Scene

Book.

The Coming Wave, Mustafa Suleyman. The book that named “the containment problem,” the task of keeping control over powerful technologies, which is the real frame under all of this. Jailbreaking and distillation are just two faces of it.

Podcast.

Latent Space, Jailbreaking AGI with Pliny the Liberator and John V. The strongest version of the other side. They argue guardrails are theater. Listen precisely because you disagree, that is how you find out if your own position holds.

Essay.

Anthropic, Detecting and Preventing Distillation Attacks. The primary source for the sixteen-million-exchange story. Read the receipts yourself instead of the headlines about them.

Product.

Claude Fable 5. The most powerful Claude ever made public, with a leash that hands the hard questions to a weaker model. Use it with the leash on. It teaches you exactly where the real frontier sits, and how little of it you actually need to win.

Prediction market.

Will OpenAI declare AGI before 2027? The crowd is sitting around 13 percent. The whole “is this the new normal” question, priced in real money instead of hot takes.

Show.

Spider-Noir. Nicolas Cage as a down-on-his-luck private eye in 1930s New York, forced to reckon with his old life as the city’s only superhero, and you can watch it in black and white or in color. A story about a man deciding whether to pick the power back up. Also just gorgeous.

Cape Fear (Apple TV+). A man walks out of prison and straight back toward the people who caged him. On theme, if you sit with it: the thing behind the bars rarely stays there.

Thanks for the reading

Vlad

Vlad's Newsletter

Discussion about this post

Ready for more?